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1. INTRODUCTION 


There are indications that progress toward higher chip density, lower cost and 
greater reliability of microprocessors and memories will continue into the next decade. 
This trend naturally leads to the design of faster and more reliable multiprocessors than 
their uniprocessor counterpart. However, use of multiple microprocessors to speed up 
general-purpose computations requires the solution of such important problems as task 
partitioning, interconnection/intercommunication, synchronization, reliability, I/O inter- 
face and handling, software structure and programmability, etc. The efficacy of the mul- 
tiprocessor depends crucially on the application tasks that it executes, and no single mul- 
tiprocessor can at present embody the optimal solution to the above issues for general- 
purpose computations. Consequently, it has been the general tendency to develop 
special-purpose multiprocessors. One such example is real-time multiprocessors whose 
primary function is control of critical real-time systems, e.g. aircraft, spacecraft, nuclear 
reactor, power distribution and monitoring, etc. Use of multiple processors/memories for 
real-time control is motivated by its potential for high operating speed and improved 
reliability through component multiplicity [l], [2]. 1 

A real-time control system comprises two components: a controlled process and a 
computer controller . Despite their synergistic relationship, these two components have 
been designed and analyzed separately in isolation: the former by control scientists and 
the latter by computer designers. Moreover, the computer controller design has usually 
relied on ad hoc/empirical methods whereas there has been a significant progress in 
theory and design of controlled processes. In order to narrow this gap and provide a 
bridge between these two components, this report considers the controller with the 

*It must, of course, be pointed out that it is hideously easy to develop multiprocessors that actually 
perform less efficiently than their uniprocessor counterparts. 
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controlled processes taken into account. 


A computer controller has three communicating functions: data acquisition , data 
processing and output functions. The data acquisition is responsible for gathering input 
(feedback) data from sensors, input panels and other associated equipment; the process- 
ing function done by the computer (in our case the multiprocessor) generates output 
control/display signals from input data and the output function sends the processed 
results to mechanical actuators, displays and other output devices. The system may 
thus logically be regarded as a three-stage pipe . 

The controller software in the processing section consists of a set of tasks, each of 
which corresponds to some job to be performed repetitively in response to particular sets 
of environmental stimuli } The set of tasks to be executed by the controller is predeter- 
mined and the stochastic nature and behavior of the software known in advance — at 
least in outline — to the designer. This fact makes it both easier and more necessary to 
obtain a reasonably good performance analysis of the system. 

The determining characteristic of a real-time multiprocessor’s performance is a 
combination of reliability and high throughput. The throughput requirements arise from 
the need for quick system response to environmental stimuli. Speed is of the essence in a 
real-time controller since failure can occur not only through massive hardware failures in 
the system, but also on account of the system’s not responding fast enough to events in 
the environment. 

As a result of these special performance requirements, performance measures used 
to characterize general-purpose uniprocessor systems are no longer appropriate for real- 
time multiprocessors. Conventional throughput, reliability, and availability by 

^These include both regular task triggers according to a predetermined schedule as well as unexpected, 
situation-dependent task triggers. 
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themselves alone have little meaning in the context of control; a suitable combination of 
these is necessary. New performance measures are required: measures that are congruent 
to the application, permit the expression of specifications that reflect without contortion 
true system characteristics and application requirements, in addition to allowing an 
objective comparison of rival systems for particular applications. 

We cannot stress too heavily that it is meaningless to speak of the performance of a 
computer out of the context of its application. The form the performance measures take 
must reflect the needs of the application, and the computer system must be modeled 
within this context. The multiprocessor controller and the controlled process form a syn- 
ergistic pair, and any effort to study the one must take account of the needs of the 
other. 

It is important that performance measures should depend on variables that can be 
definitively estimated or objectively measured. It is our policy in this report, therefore, 
to always base performance indices on experimentally-measurable quantities, i.e., con- 
troller response times for the various system tasks. 

It must also be realized that there is a distinction to be drawn between the meas- 
urement of performance parameters and their interpretation. In parameter measurement, 
we are concerned, for example, with the ease and accuracy with which the parameter can 
be measured. On the other hand, the interpretation consists of a procedure to integrate 
the results of the measurement into a complete picture of the computer’s performance. 
Note here that the different parameter values (reliability, throughput, etc.) can depend 
on one another in a quite complex way: we do not have the luxury of assuming that they 
are independent of each other. We have either to present computer performance as a 
vector (which makes comparison between different systems difficult) or to derive an 
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objective metric for the performance vector. It is the purpose of our research to develop 
one such metric and then use it for design and analysis of real-time computers. 

Performance measures that partially meet real-time requirements have been sug- 
gested by the following authors. Beaudry [3] considers measures emanating from the 
volume of computation from a computer system over a given period of operation. Mine 
and Hatayama [4] consider job-related reliability, by which they mean the probability 
that the system will successfully complete a certain job. Huslende [5] attempts to be as 
general as possible, and presents what amounts to a re-statement of Markov modeling 
with traditional measures. Chou and Abraham [6] present performance-availability 
models, Castillo and Siewiorek [7] performance-reliability models. Osaki and Nishio [8] 
consider the "reliability of information”, by which they mean the expected percentage of 
wrong outputs per unit time in steady state. 

All these measures consider the computer system in isolation, i.e. without explicit 
regard to the requirements of the operating environment. For this reason, they are quite 
unsuitable for use in real-time control situations. 

Among all existing performance measures, Meyer’s performability [9] seems to meet, 
though in an abstract form, the real-time requirements discussed above. His measure 
explicitly links the application with the computer by listing "accomplishment levels”, 
which are expressions of how well the computer has performed within the context of the 
application. His work focuses on the development of a framework for modeling and per- 
formance evaluation, rather than on methodology for deriving the performance measures 
themselves. No guidelines are given for appropriately specifying the accomplishment lev- 
els: what is provided is a set of mathematical tools for their computation, once they have 
been defined. Some recent work by Meyer [10] continues this trend, developing the 
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theory of stochastic Petri nets. 


By contrast we focus, in this report, on presenting a methodology for objectively 
characterizing and determining controller performance. If one wished to translate our 
work into the terms of Meyer’s perform ability, we show how to derive a set of uncount- 
ably many accomplishment levels that are completely objective and capable of definitive 
estimation and/or measurement. It is this that makes our measures complementary to 
that of Meyer. 

The next step is to apply the performance measures to design and analysis of real- 
time computers. For example, one should be able to answer a fundamental design ques- 
tion, ’’what is the optimum redundancy to be built in real-time systems!” As will be seen 
later, increasing component redundancy beyond a certain point becomes detrimental to 
the dynamic reliability of real-time systems. The performance measures can be used as 
(i) criteria for architectural design of real-time computers and (ii) objective tools for 
evaluating and comparing rival computers. 

This report is organized as follows. In Section 2, real-time controlled systems are 
discussed, and Section 3 introduces our performance measures. Section 4 contains two 
examples to show how to determine the performance measures; one is an idealized 
motion control problem and the other is a more realistic example, the aircraft landing 
problem. In Section 5, we explore two important applications of the performance meas- 
ures to the design and analysis of real-time computers — the number-power tradeoff and 
synchronization and fault-masking. This report concludes with Section 6. 

2. REAL-TIME SYSTEMS 

Figure 1 shows the block diagram of a typical real-time control system. 
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Figure 1. 


A Typical Real-Time Control System. 










The inputs to the control computer are from sensors that provide data about the 
controlled process, and from the environment. This is typically fed to the control com- 
puter at regular intervals. Data rates are usually low: generally fewer than 20 words a 
second for each sensor. The job list represents the fact that all the control software is 
pre-determined and partitioned into individual jobs. 

Central to the operation of the system is the trigger generator that initiates execu- 
tion of one or more of the control programs. In most systems, this is physically part of 
the controller itself, but we separate them here for purposes of clarity. Triggers can be 
classed into three categories. 

(1) Time-generated trigger: These are generated at regular intervals, and lead to the 
corresponding controller job(s) being initiated at regular intervals. In control 
theoretic terms, these are open-loop triggers. 

(2) State-generated trigger: These are closed-loop triggers, generated whenever the sys- 
tem is in a particular set of states. A simple example is a thermostat that switches 
on or off according to the ambient temperature. For practicality, it might be 
necessary to space these triggers by more than a specified minimum duration. If 
time is to be regarded as an implicit state variable, the time-generated trigger is a 
special case of the state-generated trigger. One can also have combinations of the 
two. 

(3) Operator-generated trigger: The operator can generally over-ride the automatic sys- 
tems, generating and cancelling triggers at will. 

The output of the controller is fed to the actuators and/or the display panel(s). 
Since the actuators are mechanical devices and the displays are meant as a human 
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interface, the data rates here are usually very low. Indeed, a control-computer system 
generally exhibits a fundamental dichotomy from many points of view. Firstly, the I/O 
is carried out at rather low rates 3 and the computations have to be carried out at very 
high rates owing to real-time constraints on control. Secondly, the complexity of the 
data processing carried out at the sensors and the actuators is much less than that car- 
ried out in the main data-processing area. Thirdly, the sensors, actuators, and the asso- 
ciated equipment are entirely dedicated to the performance of a particular set of tasks, 
while the hardware in the region where the complex data processing takes place is usu- 
ally not dedicated. 

It is therefore possible to logically partition real-time computer systems into central 
and peripheral areas. The peripheral area consists of the sensors, actuators, displays, 
and the associated processing elements used for the pre-processing and formatting of 
data that is to be put into the central area, and the ” unpacking” of data that are put 
out from the central area to the actuators and/or displays. The central area consists of 
the processors and associated hardware where all the higher-level computation takes 
place. Designing the peripheral area is relatively straightforward; the most difficult 
design problems that arise in these systems usually concern the central area. Figure 2 
and Table 1 emphasize these points. 

A control system executes ” missions.” These are periods of operation between suc- 
cessive periods of maintenance. In the case of aircraft, a mission is usually a single 
flight. The operating interval can sometimes be divided down into consecutive sections 
that can be distinguished from each other. These sections are called phases. For exam- 
ple, Meyer et ai [11] define the following four distinct phases in the mission lifetime of a 

3 The only exceptions to this that we know of are control systems that depend on real-time image- 
processing. Such applications have extremely high input data rates. 


8 




CLUSTER 


Figure 2. 


Schematic Decomposition of a Real-Time Control Computer. 



Peripheral Area 

Central Area 

Low baud rates 

High baud rates 

Complete dedication 

Complete generality of function 

Low-capability processors 

High-capability processors 

Simple interconnection structure 

Complex interconnection structure 

Almost totally decoupled processors 

Processors highly coupled in many cases 

Trivial executive software 

Complex executive software 


Table 1. Difference between Central and Peripheral Areas. 
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civilian aircraft: 


(a) Takeoff/cruise until VHF Omnirange (VOR)/Distance Measuring Equipment 
(DME) out of range. 

(b) Cruise until VOR/DME in range again. 

(c) Cruise until landing is to be initiated. 

(d) Landing. 

The current phase of the controller partially determines its job load, job mix, job 
priorities, and so on. 

A real-time system typically has to function under more constraints than its 
general-purpose counterpart. Firstly, there are hard deadlines , which if missed, can lead 
to catastrophic failure. Timing is therefore crucial to job execution. Secondly, there are 
physical constraints that are not quite so restricting for the general-purpose computer. 
Examples are weight and power consumption. 

The applications software has the following properties. 

(1) The interaction between individual processes is minimal. 

(2) The effects of processes upon one another is well understood. 

(3) Clear lines of authority are recognized. 

(4) Clear lines of information flow are recognized. 

(5) The products of the process are well defined. 

These are precisely the five conditions for efficiency in a distributed system as listed by 
Fox [12]. Because of this and also due to their potentially high reliability, distributed 
systems are particularly suited to real-time use. Also, the problems that arise when one 
attempts to partition programs in general-purpose applications for implementation on a 
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distributed computer do not usually arise in the real-time context. The software for a 
control computer is not so much a single partitionable package, as a set of cleanly 
interacting subroutines. Macro-instruction languages show much promise in this con- 
text [13]. 

The constraints on real-time systems as well as the properties of the applications 
software have a very great influence on the system architecture and the executive 
software. 

The overall computer system has to be much more reliable than any of its com- 
ponents, so that fault-tolerance is essential. Massive replication of hardware is common- 
place, as also are high interconnection-link bandwidths. The system must be as sym- 
metric as possible so that reconfiguration is easy. 

The nature of the executive software must reflect constraints on time and 
resources. The executive is responsible for the control of queues at shared resources, for 
the scheduling of events, for the handling of interrupts, and the allocation of memory. 
While all these tasks are common to general-purpose systems, the existence of hard 
deadlines makes the efficient execution of such activities imperative. The designer of the 
real-time system does not have the luxury of assuming that occasional serious degrada- 
tion of performance is acceptable, if unfortunate. 

An additional important task of the executive is fault-handling and recovery. This 
includes reconfiguration where that is possible, and the reallocation of tasks upon failure. 
Here again, the constraints on time make this a difficult problem. 


12 



3. THE PERFORMANCE MEASURES 


3.1. Terminology and Notation 

A real-time computer executes pre-defined control jobs repeatedly, upon environ- 
mental or other stimuli. A job is a well-defined stretch of software, e.g., a subroutine. 
Each job maps into one or more tasks . The mapping is determined by the current state 
of the controlled process. This is further clarified later in this Section. System response 
time is defined as the time between the initiation/ triggering of a control job and the 
actuator and/or display output that results. This quantity is the sum of controller 
response time and actuation time . Environmental or other occurrences trigger the tasks, 
a unique version being created as a result. This is said to be an extant version as long as 
it continues to execute in the system. Versions of task t are denoted by V ijy which 
represents the /-th execution of task i. Denote the response time of a no-longer-extant 
version V l; by RESP( V l; ). The response time of an extant version is undefined. The 
extant time of a version triggered at time r,y when the system is in state n i; is given 
by E( V^n^t) = min(<-r,y, RESP{V »)). 4 

A controller task is said to be critical if it has an associated hard deadline [14], 
which if exceeded by any of its versions, results in catastrophic or dynamic failure. Hard 
deadlines do not exist for non-critical tasks. 

Ordinarily, repair to the real-time computer is not allowed while the computer is in 
operation. In this connection, we define the mission lifetime as the duration of operation 
between successive stages of service. We let the mission lifetime be a random variable 
with probability distribution function L(t). At the beginning of a mission (i.e. immedi- 

4 This does not imply that no state changes occur during the course of a task execution. RESP{ Vjj) is 
an implicit function of fl,y. 
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ately after service), a system is assumed to be free of faults. 


3.2. Definition of the Performance Measures 


Our performance measures are all based on the extant and response times. For crit- 
ical tasks », with hard deadlines t^, the cost function is defined by: 


CIS) = { 


el=) if o < s < ij, 

oo if 5 > t di 


( 1 ) 


where 0, is called the finite cost function of task i, and is only defined in the interval 
[0, t rfi ], and we omit for notational convenience the arguments of S, the extant time. 

For non-critical tasks, the same definition for the cost function can be used, with 
the associated hard deadline set at infinity. 

Let qft) denote the number of times task « is initiated in the interval (0,f). Then, 
the cumulative cost function for the task * is defined as 

<7,(0 

r,(0= E <?.{=( v >v T v n aA) (2) 

y=i 

and the system cost function is defined as 

30=Er/<) (3 ) 

i=l 

where r is the number of tasks in the system. Both T, and S are clearly defective random 
variables. Our performance measures are then given by: 


OO 

Cost Index, K{\) = J Prob{S(t)<x} dL(t) (4) 

o 

OO 

Probability of dynamic failure, p = f Prob{S(t)=oo} dL(t) (5) 

o 
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oo 

Mean Cost, Af= j E (5(t) | no hard deadlines are missed}dL(t) (6) 

o 

OO 

Variance Cost, V = / Var (5(/) | no hard deadlines are missed) dL(t) (7) 

o 

where and V r ar{»|»} represent conditional expectation and variance, respectively. 

The probability of dynamic failure subsumes the traditional probability of failure (called 
here for distinction the probability of static failure ) since the latter can be viewed as the 
probability that the expected system response time is infinity. Clearly, in the case of 
non-critical tasks, the probabilities of static and of dynamic failure are equal. 

The following auxiliary measures are useful when one focuses on the contribution to 
the cost of individual tasks. 

CO 

Cost Index for Task i, K t (\) = / Prob{F ft)<x) dL(t) (4a) 

o 

OO 

Mean Cost for task i, A/ ( = f E {r,{f) | no hard deadlines are missed} dL(t) (5a) 

o 

00 

Variance Cost for task i, = J>«'( r/0 | no hard deadlines are missed} dL(t) (7a) 

o 

The computation of these measures can sometimes be complicated by the fact that the 
mission might end while one or more versions are still extant. In most instances, how- 
ever, the mission lifetimes are very much longer than individual task execution times 5 
and the number of times tasks are executed to completion before the mission ends is also 
very large. For this reason, it is usually an acceptable approximation to compute the 
costs assuming that all jobs that enter the system during the mission complete executing 
before the mission ends (as long as they do not miss any hard deadlines). 

6 For example, it could be several hours for aircraft and several days or even months for spacecraft. 
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In what follows, we consider how to determine these performance measures, begin- 
ning with the determination of the hard deadline. 

3.3. Obtaining the Hard Deadline 

The dynamics and the nature of the operating environment of the critical process 
are both known a priori. This follows from the critical nature of the process — for exam- 
ple, the dynamics and operating environment of aircraft have both been studied care- 
fully — and advances in the theory and design of controlled processes. 

The process can most conveniently be expressed by a state-space 6 model. Let xCR n 
denote the process state, u(ER m the input vector, and t the time. The input vector is 
made up of two sub-vectors, u^R™' and u ( 6R m ‘. u c denotes the input delivered at the 
command of the computer, and u e the input generated and then applied by the operat- 
ing environment. We characterize state transitions by the mapping $:TX TXXX U -* *X 
where TCR represents time, XCR" the state-space and UCR m the input space. 

x(0 = <f> (t,t 0 ,x(t 0 ),u) (9a) 

Measurement of the system is described by a vector y6R*and a mapping ijXXUX T. 

y (0 = 9 (x(0,u(f),0 (9b) 

Catastrophic failure can follow if the process leaves the "safe” region of the state 
space. For example, a boiler may explode if its temperature becomes too high. This is 
formally expressed by defining an allowed state space, X„(<)> which defines the "safe” 
region of operation. 


^The term n state” here has the control-theoretic meaning, and is not the same as the one frequently 
used in computer performance analysis. 
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The task of the controller or real-time computer is to derive the optimal control, 
as a function of the perceived process state. Since the response time, denoted by 
<jj, is positive, we have u c (t) = h(x(f-cj),u e (f-w),t) where h expresses the control algo- 
rithm for the task in question. Then, the hard deadline associated with this task is 
given by the maximum value of u> that may be permitted if the process is to remain in 
X a with probability one. More precisely, the hard deadline associated with controller 
task a triggered at t 0 when the system is in state x(f 0 ) is given by: 

<<fa(x(*o)) — u J”f v su P{r I ^o+r,t 0 ,x{t 0 ),u)eX a } ( 10 ) 

where fl is the admissible input space. One can also define conditional hard deadlines if 
it is only required to perform the computation over a certain subset of the admissible 
input or state space. The conditional hard deadline of task a, denoted by t d 0 | u< „ is 
defined as 

'dbkaM'o)) = ,fI / *«P{ r I <H*o+ r >*o> x (*o)> u )e*CX a } (11) 

The hard deadline, defined in this general way, is a function both of the process 
state at the moment of task initiation, and of time. It is a random variable if the 
environment is stochastic. Consider the following example to see how the hard deadline 
can be determined. 

Example Is A body of mass m is constrained to move in one dimension. Its state- 
vector consists of three components: position (zj, velocity (a^), and acceleration (z 3 ). The 
allowed state-space is defined by X a ={x [ | z 2 <oo, z 3 <oo} where 6>0 is a con- 
stant. The body is subject to impact from either direction with equal probability. Each 
impact changes the velocity of the body by k > 0 units, in the appropriate direction. The 
change of velocity takes place in a negligible duration. The body has devices that can 
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exert thrust of magnitude H in either direction. This thrust is imposed only after the 
controller has recognized an impact and has determined how to react to it. It takes a 
negligible amount of time to switch the thrust on or off in either direction. The 
controller’s job is to bring the body to x=0. The controller operates in open-loop; when 
it recognizes an impact, it computes the thrusts as a function of time, following which 
the control response is assumed to be instantaneous. The problem now is to compute 
the hard deadline associated with this task. 

The hard deadline is only a function of the state and X a . In computing it, we do 
not need to take into account the possibility of a second impact before the controller has 
finished responding to the first, since the state of the process contains all necessary infor- 
mation. 

The allowed state space is static and simply connected, so that if when the body is 
brought to rest for the first time following the impact it is in X 0 , it must have been 
within X a throughout the period following the impact, assuming only that it was within 
X a at the moment of impact. Therefore, we have only to compute the position of the 
body when it first comes to rest (after the impact) as a function of the response time, £, 
and set the hard deadline equal to the largest £ for which the body comes to rest within 
X a . Let the initial state of the body be x?=[x xi , x 2i z 3l ]. Since the impact duration and 
the switch-off or on time for the thrust are assumed to be zero, we can always take 
£ 3 ,=0. Define 

h = 1*2,+*] |, <2 = I ^ [*2,-*] I 

By an elementary derivation, we arrive at the following. 

Case 1 , X 2 i<-k : 
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( 12 ) 


t d (x;)=minl 


b + jj,- + (x 2 i+k)t l + 
-i*2i + *) 


Ht{ 

2m 


Hf 

b + x u + {x 2 ,-k)t 2 + 
k-* 2 , 


For future convenience, denote the right hand side of the above by 
Case 2, |x 2l | <k: 


*i x .)= min 


Ht* Ht 2 

b - x u - {x 2 ,+k)ti + b + x u + 


z 2> - + k 


k- - *2i 


(13) 


Denote the right hand side of the above equation by 
Case 3, x 2 {>k: 


<d( x i)= min 


Ht 2 Ht 2 

- *u - (* 2 ,+*)<i + b-x u - {x 2 ~k)t 2 + 


(14) 


* 2 . + k 


*2i~ * 


Denote the right hand side of the above equation by t 3 \ 

If the velocity imparted to the body upon an impact is not constant at k, but is a 
random variable (which would be more realistic) the magnitude of which has probability 
distribution function then the hard deadline will be a random variable, whose 

distribution is a function of the state at the moment of impact. The following can be 
written down by inspection: 

Case 1, x 2i < 0: 


1 0 with probability F impael (|a^.|) 
i i ( x i) — \ t (y) w ith probability 1 - F impact (1^,1) 


( 15 ) 


Case 2, x 2l >0: 
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( 16 ) 


J<( 2) with probability l - F impact fla^) 

( J ( x .) — {*(3) with probability F impact (|a^,|) 

We could similarly treat the case when the allowed state-space is stochastic. 

The above example is meant only to illustrate the hard deadlines and should not 
lull the reader into a false sense of security. Obtaining closed-form expressions for the 
deadlines of any but the most trivial systems and static allowed space is usually 
extremely difficult, if not impossible. For example, if we relax the assumption that the 
controller acts in open-loop, the equations of motion become too difficult to solve exactly 
in closed form. 

It is generally necessary to resort to numerical methods to obtain deadlines for 
real-life systems. Since most of the state-spaces one uses in practice have uncountably 
many points, we must define hard deadlines as functions of sets of states, not of the 
states themselves if the entire allowed state-space is to be covered. Subdividing the 
state-space into these subsets while keeping errors low is not always easy. For an exam- 
ple of subdivision where the application is the control of aircraft elevator deflections, see 
the case study that follows in Section 4. 

A further remark is in order here. The hard deadlines are not dependent upon the 
performance functional (time, energy, etc.) that the controller is attempting to optimize, 
since the paramount duty of the controller is to keep the system within the allowed 
state-space, and only secondarily to optimize the performance functional. 

3.4. Obtaining the Finite Cost Functions 

Performance functionals have been known for a long time in control theory as 
optimization criteria and measures of controlled process performance. We exploit this 
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fact to derive the control-computer cost functions, by linking directly the performance of 
the controller to the value of the controlled process performance functional that results. 

Performance functionals in control theory are functionals of system state and input 
and express the cost of running the process over some interval K */]• The performance 
functionals can be stated as: 
t f 

©(*o, <o, tf) = f tQ E [ f 0 (x(<), u(<), Xq, t) | y(r), t 0 < r < t ]dt (17) 

where x(f 0 ) = x 0 , and / 0 is the instantaneous performance functional. Since the con- 
troller response time affects the state trajectory of the controlled process, it affects the 
performance functional as well. If we use the expected contribution to the performance 
functional, 0(x o , t 0 , fy), of the control delivered as a result of executing task s with 
response time we can derive the finite cost function as: 




n(x,{) - n(x,0) for 0 < £ < t J{ 
0 otherwise 


(18) 


where f2(x,£) denotes the contribution to 6(x 0 , t 0 , tj) of a task with response time £, and 
initiated when the process state was x. By doing so, we can directly couple the response 
time of the controller to the fuel, energy, or other commodity by the controlled process. 
See Figure 3. 

Notice that while we use response time to compute the cost function, the finite 
costs were originally defined as functions of the extant time. The latter is the case since 
we wish costs to accrue as the execution proceeds, so that the system cost function is 
continuous. This ensures if two systems are compared under an identical load with the 
first faster than the second with regard to a particular task, that the faster system will 
never exhibit a mean cost greater than the slower system as long as both have approxi- 
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mately the same probability of dynamic failure. 


The possibility of correlation of successive tasks can complicate calculations consid- 
erably. To see this, take the system in the example above. If a second impact comes in 
before the system has finished reacting to the first, then, the energy or time to be 
expended will not, in general, be the sum of the energies or the times that would have to 
be expended if the second impact had arrived after the system had finished reacting to 
the first (i.e. had arrived at x=0). Assuming that successive tasks are decoupled leads 
to a certain measure of ” double-counting” of the energy or time spent. The same remark 
would apply to fuel, force, or any other performance functional used for the controlled 
process. 

Due to this double-counting, assuming that successive tasks are decoupled leads to 
an upper bound to the energy, time, or other quantity expended. If we find an upper 
bound acceptable, we can simplify our computations greatly. If exact figures are called 
for, a detailed and complicated model has to be worked out in which each instance of 
inter-task coupling is itemized and its probability of occurrence computed. Whether or 
not this is worth the effort depends entirely on the requirements of the analysis. 

There is also an irritating anomaly. Since the mean costs are defined by an expecta- 
tion that is conditioned on not failing in the mission lifetime, it is possible to construct 
pathological examples where a system with a probability of dynamic failure of, say, 0.5 
over a given lifetime, will exhibit a lower mean cost over that lifetime than another that 
has a p dyn of 10" 10 : we shall see examples of them in Section 5. Such cases are, however, 
generally no more than an academic curiosity. 

Example 2: Consider again the controlled process described in Example 1. This time, 
we set out to compute the finite cost function associated with the task under review. 
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As before, assume that the state of the body at the moment of impact is given by 
x r =[:r ll -,:t 2 i»* 3 i]' We make the assumption that a function that provides an upper bound 
of the cost expended is sufficient, so that it is not necessary to consider the correlative 
effect of successive jobs. We provide cost functions relating to two different control poli- 
cies. 

Case A: Assume that the duty of the controller is to bring the body back to x=0 
within as short a time period as possible. (Note that x=0 means that all three com- 
ponents -- position, velocity, and acceleration -- are zero). The cost function is the time 
taken. This is the well-known minimum-time problem in optimal control theory [15]. If, 
after the impact, the body is moving away from Xi=0, it must be stopped, and brought 
back using bang-bang control. If it is moving toward Zj=0, depending on the velocity 
after impact and the response time of the controller, the body is either first accelerated 
toward Zj=0 and then decelerated, or first brought to a stop on the other side of z t =0 
and then brought back to the origin using bang-bang control. The derivation of the 
time taken is elementary, if tedious, and is excluded. See the Appendix for expressions 
of the finite cost function under such a control policy. The case when the velocity 
imparted upon impact is not constant, but a random variable, can be handled as in 
Example 1. 

Case B: Suppose the controller is to minimize the energy expended while, after every 
impact, keeping the body within the allowed state-space. Then, the control policy is 
simply to bring the system to rest anywhere inside X 0 , and the cost function is in terms 
of energy. If the controller computer responds to an impact within the hard deadline, it 
can by definition, keep the system from failing. As may easily be verified, the energy 
expended in doing this is the energy required to bring the body to rest, which is equal to 
the energy of the body immediately after impact. Therefore, as long as the allowed 
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state-space is not violated, the energy expended will remain the same no matter what 
the response time (assuming that the response time is within the hard deadlines derived 
in the preceding section). So, the finite cost function over the entire allowed state-space 
is here the zero function, which signifies that, as long as the hard deadlines are honored, 
it makes no difference to the overhead under this control policy, as to what the response 
time may be. 

This example has served to emphasize the intimate relation between control policy 
and controller (finite) cost function. The same system, with the same constraints on the 
allowed state-space, has different cost functions based on what the duty of the controller 
is. It reflects our goal of having the cost functions express the control overhead in the 
context of the application . This, in fact, distinguishes our measures from those extant in 
the literature. See Table 2 for a comparison between our measures and those of others. 

Once again, it should be noted that the simplicity of the above expository examples 
does not usually exist in real-life systems. Real-life analyses are much more difficult, and 
the same comments as applied to p dyn above apply to the mean cost, also. There is also 
one additional complication. The controller is to optimize the performance functional 
subject to the condition that the system must not leave the allowed state-space, if this is 
at all possible. Such a difficulty did not arise in this simple example, but it can some- 
times prove difficult to obtain optimal control policies under this requirement. This, 
however, is a problem for the designer of the controlled process, not of the controlling 

computer. 

See Section 4 for a computation of the cost function in a realistic case. 
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Other Measures 

Our Measures 

Wide applicability to almost all 
applications of fault-tolerant, 
gracefully-degrading systems. 

Limited Applicability. Aims 
specifically at real-time, especially 
at control, application. 

Measures express performance in 
rather gross terms. 

Measures express performance in 
rather exact terms. 

Performance linked to charac- 
teristics of the computer alone. 

Performance measures specifically 
designed to reflect the overhead 
of the computer on the real-time 
system. 


Table 2. Comparison of Traditional and New Methods of Characterizing Per- 
formance 
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3.5. Remark on Finite Cost Functions 


It is not necessary that the process performance functional that is used to derive 
the cost function (e.g. energy, fuel, time, etc.) be the same as the process performance 
functional that the system is trying to optimize. For example, the system may be given 
the task of optimizing fuel, and the cost function may be measuring the extra energy 
consumed as a result of controller delay. However care should be taken to ensure that 
the two functionals (the one used to optimize the process, and the one used to express 
the cost with) do not conflict. For the functionals not to conflict, the optimal control 
actions (i.e. the controller decisions) taken on the basis of one functional should be ident- 
ical to the optimal control actions that would have been taken on the basis of the other. 
For example, if the fuel consumed were linearly related to the energy expended, the cost 
function could be expressed in terms of energy, while the controller was trying to minim- 
ize the fuel used. 

To see why conflicts must not be allowed, assume in the system of the above exam- 
ple, that the job of the controller is to minimize the time taken in bringing the body 
back to the origin, while the cost function is in terms of energy. Take the instance in 
which the speed of the body after the impact is a slow motion toward ^=0. The con- 
troller should in such a case apply full thrust throughout the motion, first speeding the 
body up toward z t =0, and then slowing it down to reach x=0 in minimum time. How- 
ever, since the cost function is in terms of energy, not time, it is easy to see that the 
shorter the time period over which the controller exerts thrust, the smaller is the value 
of the cost measured. Thus, over a certain range of states, the cost function would actu- 
ally decrease with an increase in response time. This is not only counter-intuitive, but 
also results in inefficient operation. Task priorities, scheduling policies, etc., for the con- 
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trol computer are meant to be derived to optimize the mean cost as expressed by the 
cost functions. If inconsistencies such as the above arose, the operating system of the 
controller would tend to oppose the goals inherent in its own applications software, 
resulting in an unsatisfactory overall computer-controlled system. 

3.6. Allowed State-Space and Its Decomposition 

As we said above, it is difficult to determine the hard deadline and the finite cost 
function as a function of the state over the entire state space. The solution of the con- 
trolled process state equations cannot usually be obtained in closed form when controller 
delay is considered. To obtain the functional dependence of the hard deadlines or the 
finite cost function of each controller job on the current state vector is therefore impossi- 
ble to do analytically, and prohibitively expensive to do numerically for a large number 
of sample states. 

To get around this problem, we divide the allowed state-space down into subspaces. 
Subspaces are aggregates of states in which the system exhibits roughly the same 
behavior . 7 In each subspace, each critical controller job has a unique hard deadline. 
Remark: In some subspaces, a job described in general as "critical” might not be criti- 
cal in the sense that even if the execution delay associated with it is infinity, catas- 
trophic failure does not occur. That is, the associated hard deadline may be infinity for 
a particular subspace. What does usually happen in these circumstances is that the sys- 
tem moves into a new subspace — or at the least toward the subspace boundary — in 
which the dangers of catastrophic failure are greater. In this subspace, the requirements 
on controller delay are more stringent, and there might well be a hard deadline, 

7 Even if there do not exist clear boundaries for these subspaces, one can always force the allowed state 
space to be divided into subspaces so that a sufficient safety margin can be provided. This is a designer’s 
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representing a critical task. Thus a "critical” job need not be truly critical in every sub- 
space, it only has to map into a critical task — defined in the sequel — in at least one 
subspace. Also, subspaces are job-related, i.e. the same allowed state space can divide 
into a different set of subspaces for each control job. 

For convenience, a controller "task" is defined as follows. 

Definition: A controller task, often abbreviated to "task", is defined as a controller job 
operating within a designated subspace of the allowed state space. 

* 

Let S, for i— 0 , 1 ,. ..,3 be disjoint subspaces of with — (JS, an d ^ J denote a 

controller job. Then, we need the projection^ J, X^) — ► ((T 0 , S 0 ), (T lf SJ, ...,(T„ S,)) 
where is the controller task generated by executing J in S t . With each controller task, 
we may now define a hard deadline without the coupling problem mentioned above. We 
denote it by t J d% for critical task T, (for convenience, however, the superscript J will be 
omitted in the sequel). We will see that a critical job can possibly map into a non- 
critical task for one or more allowed subspace; it only needs to map into a critical task 
in at least one such subspace to be considered critical. 

3.8.1. Allowed State-Space 

The allowed state-space is the set of states that the system must not leave if catas- 
trophic failure is not to occur. Consider the two sets of states X^ and X/4 defined as fol- 
lows. 

(i) X^ is the set of states that the system must reside in if catastrophic failure is not 
to occur immediately . For example, we may define in the case of an aircraft, a 

choice for approximation. 
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situation in which the aircraft flies upside down as unacceptable to the passengers 
and as constituting failure. Notice that terminal constraints are not taken into con- 
sideration here unless the task in question is executed just prior to mission termina- 
tion. 

(ii) is the set of acceptable states given the terminal constraints, i.e., it is the set of 
states from which, given the constraints on the control, it becomes possible to 
satisfy the terminal constraints. 

Note that leaving means that no matter how good our subsequent control, failure 
has occurred . 8 On the other hand, altering the allowed input space, i.e. changing the 
control available can affect the set X^. The allowed state space is then defined as 

x A s xJi n x* A . 

Obtaining state-space can be difficult in practice. The curse of dimensionality 
ensures that even systems with four or five state variables make unacceptable demands 
on computation resources for the accurate determination of the allowed state-space. 
However, while it can be very difficult to obtain the entire allowed state-space, it is 
somewhat easier to obtain a reasonably large subset, X^CX^. By defining this subset as 
the actual allowed state-space, (i.e., by artificially restricting the range of allowed states), 
we make a conservative estimate for the allowed state-space. Note that by making a 
conservative approximation, we err on the side of safety. Also, the information we need 
about X^ may be determined to as much precision as we are willing to invest in comput- 
ing resources. 


8 Strictly speaking, of course, there can be no subsequent control since by leaving the system has 
failed catastrophically before the next control could be implemented. 
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In what follows, to avoid needless pedantry, we shall refer to the artificially res- 
tricted allowed state-space, X^, simply as the "allowed state-space”, X A . 

3.6.2. On Obtaining the Subspaces 

The job of dividing X^ into S = (S 0 , Sp ..., S,) is sometimes made easy by the 
existence of natural cleavages in the state-space, when the latter is viewed as an influ- 
ence on system behavior. In most cases, however, such conveniences do not exist, and 
artificial means must be found. The problem then becomes one of finding discrete subdi- 
visions of a continuum. 

The method we employ is to quantize the state continuum in much the same way 
as analog signals are quantized into digital ones. Intervals of hard deadlines and 
expected operating cost (i.e. the mean of the cost function conditioned on the controller 
delay time, and using the distribution of the latter) are defined. Then, points are allo- 
cated to subspaces corresponding to these intervals. To take a concrete example, con- 
sider a state-space XCR" that is to be subdivided on the basis of the hard deadlines. 
The first step is to define a quantization for the hard deadlines. Let this be A. Then, 
define subspace S, as containing all states in which the hard deadline lies in the interval 
[(«-l)A, iA). Alternatively, one might define a sequence of numbers Aj, A 2 , ..., such 
that the subspaces were defined by intervals with the A* t as their end-points. This 
would correspond to quantizing with variable step sizes. The subspace in which the job 
under consideration maps into a non-critical task is a special case and is denoted by S 0 . 

Subspaces can also be defined based on a quantization of the expected operating 
cost or on both the operating cost and the hard deadlines. We provide an example of 
subdivision by hard deadlines in Section 4. 
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The size of each subspace will depend on the process state equations, the environ- 
ment, and how much computing effort it is judged to be worth spending on obtaining 
the subspaces. Naturally, everything else being equal, the smaller a subspace the greater 
the accuracy of the inherent approximation. 9 

4. CASE STUDY 

A control system executes "missions.” These are periods of operation between suc- 
cessive periods of maintenance. In the case of aircraft, a mission is usually a single 
flight. The operating interval can sometimes be divided down into consecutive sections 
that can be distinguished from each other. These sections are called phases. As pointed 
out in Section 2, Meyer et al. [1 1] define four distinct phases in the mission lifetime of a 
civilian aircraft. The phase to be considered here is landing, it takes about 20 seconds. 
The controller job that we shall treat is the control of the aircraft elevator deflection 
during landing. 10 

The specific system employed is assumed to be organized as shown in Figure 4. 
Sensors report on the four key parameters: altitude, descent rate, pitch angle, and pitch 

angle rate every 60 milli-seconds. 11 We have a time-generated trigger, with a time period 
of 60 milli-seconds. Every 60 milli-seconds, the controller computes the optimal setting 
for the elevator, which is the only actuator used in the landing phase. 12 The execution 
time for the computation is nominally 20 milli-seconds, although this can vary in 

0 The error that ensues as a result of quantization of the state space can be estimated in the same way 
that quantization error is estimated in signal processing theory. 

10 The output of the controller is assumed to be fed into a peripheral processor that is dedicated to con- 
trolling the actuator — in this case the elevator. 

11 The sensors and actuators are assumed to have their own dedicated processors for I/O purposes. 
When we speak of "controller delay,” we also include the delay in these processors. Also, the period of 60 
milli-seconds is arbitrary, and the choice of this period does not alter the method developed here. 

12 There are other actuators used aboard the aircraft for purposes of stability, horizontal speed control, 
etc. We do not however consider them here, concentrating exclusively on the control of the elevator. 
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Figure 4. 


Aircraft Control System Schematic. 







practice due to failures. Since the aircraft is a dynamical system, the effects of con- 
troller delay are considerable — as we shall see in this Section. 

Since the process being controlled is critical (i.e. in which some failures can lead to 
catastrophic consequences), variations of controller delay and other abnormal behavior 
by the controller must be explicitly considered. For simplicity, we do not allow job pipe- 
lining in the controller; in other words a controller job must be completed or abandoned 
before its successor can be initiated. The following controller abnormalities can occur: 

(i) The controller orders an incorrect output to the actuator. 

(ii) The controller takes substantially more than 20 milli-seconds (the nominal execu- 
tion time) but less than the inter-trigger interval of 60 milli-seconds to complete 
executing. 

(iii) The controller takes more than 60 milli-seconds to complete executing. In such a 
case, the abnormal job is abandoned and the new one initiated. We say that a 
control trigger is ” missed” when this happens. 

An analysis of controller performance during the landing phase must take each of the 
above abnormalities into account. 

4.1. The Controlled Process 

The model and the optimal control solution used are due to Ellert and Merriam 

[16]. 

The aircraft dynamics are characterized by the equations: 

*i(0 = &n*i(0+&i2*2(0+^3*3(0+ c ii m i(*>f) (19a) 

(19b) 


* 2(0 = * l (0 
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*3(0 — ^32 z 2(0d‘ ^33*3(0 

(19c) 

*i(0 = *3(0 

(19d) 


where x 2 is the pitch angle, z x the pitch angle rate, the altitude rate, and x A the alti- 
tude. m l denotes the elevator deflection, which is the sole control employed. The con- 
stants 6 J; and c n are given in Table 3. Recall that £ denotes controller response time. 

The phase of landing takes about 20 seconds. Initially, the aircraft is at an altitude 
of 100 feet, travelling at a horizontal speed of 256 feet/sec. This latter velocity is 
assumed to be held constant over the entire landing interval. The rate of ascent at the 
beginning of this phase is -20 feet/sec. The pitch angle is ideally to be held constant at 
2 \ Also, the motion of the elevator is restricted by mechanical stops. It is constrained 
to be between -35 0 and 15 For linear operation, the elevator may not operate against 
the elevator stops for nonzero periods of time during this phase. Saturation effects are 
not considered. Also not considered are wind gusts and other random environmental 
effects. 


The constraints are as follows: The pitch angle must lie between 0 0 and 10 0 to 
avoid landing on the nose-wheel or on the tail, and the angle of attack (see Figure 5) 
must be held to less than 18 0 to avoid stalling. The vertical speed with which the air- 
craft touches down must be less than around 2 feet/sec so that the undercarriage can 
withstand the force of landing. 

The desired altitude trajectory is given by 



100e _< / 5 0<t<15 
20-f 15<<<20 


( 20 ) 


while the desired rate of ascent is 


35 



Feedback Term 


Value 


*11 

-0.600 

&12 

-0.760 

&13 

0.003 

&32 

102.4 

&33 

-0.4 

C 11 

-2.374 


Table 3. Feedback Equation Constant 
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Figure 5. 


Definition of Aircraft Angles. 




-20e - */ 5 0<(<15 
-1 15<f<20 


( 21 ) 


The desired pitch angle is 2 ° and the desired pitch angle rate is 0 ° per sec. 

The performance index (for the aircraft) chosen by Ellert and Merriam and suitably 
adapted here to take account of the nonzero controller response time £ is given by 

'/ 

©(0 = ( 22 ) 

<o 

where t represents time, and K t J is the interval under consideration, and where 

+m M<U(01 2 +K(«)] 2 

where the d-subscripts denote the desired (i.e. ideal) trajectory. To ensure that the 
touch-down conditions are met, the weights <f> must be impulse weighted. Thus we 
define: 


0a(O = 04(0 + 20-f) 

(23a) 

H*) = 0 3 (O + 03 ,«/(2°-0 

(23b) 

Ut) = 0 2l ,/O^(2o-O 

(23c) 

*0- 

II 

(23d) 


where the functions <j> must be given suitable values, and 6 denotes the Dirac-delta func- 
tion. The values of the 0 are given based on a study of the trajectory that results. The 
chosen values are listed in Table 4. 

The control law for the elevator deflection is given by: 

T a \h . ( KM, ,( K)*,( KM»( K)*a( K) 
-MK)* 3 (K)-MKK(K)] ' (24) 
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Weighting Factor 

Value 

H‘) 

99.0 

$ 2,t/0 

20.0 

fait) (0<f<15) 

0.0 

M*) (15<<<20) 

0.0001 

fa,t, 

1.000 

fa 

0.00005 

fa ,*t 

0.001 


Table 4. Weights for the Performance Index 
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■where the aircraft parameters are given by: K, = -0.95 sec" 1 , T t = 2.5 sec, 

= 1 radian sec" 1 and the constants k are the feedback parameters derived (as shown 
in [16]) by solving the Riccatian differential equations that result upon minimizing the 
process performance index. For these differential equations we refer the reader to [16]. 

4.2. Derivation of Performance Measures 

We consider here only one controller task: that of computing the elevator deflection 
so as to follow the desired landing trajectory. The inputs for the controller here are the 
sensed values of the four states. 

We seek the following information. As the controller delay increases, how much 
extra overhead is added to the performance index? Also, it is intuitively obvious that 
too great a delay will lead to a violation of the terminal (landing) conditions, thus result- 
ing in a plane crash. This corresponds to dynamic failure, and we are naturally 
interested in determining the range of controller delays that permit a safe landing. 

Consider first a formal treatment of the problem. The control problem is of the 
linear feedback form. The state equations can be expressed as: 

x(f) = Ax(<) + Bu(f) (25) 

where the symbols have their traditional meanings. Define the feedback matrix by S (/). 

Then, clearly, 

u(f) = E(f-£)x(f-£) (26) 

For a small controller delay (i.e., a small £), the above can be expanded in a Taylor 

series and the terms of second order and higher discarded for a linear approximation. By 

carrying out the obvious mathematical steps, we arrive at the equation: 

x(0 = E(t,£)x(0 + flf) (27) 
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as representing the behavior of the controlled process, assuming that the initial condi- 
tions are given. For further details, see Figure 6. 

Given a closed-form expression for the k i} {t) that appear in E(t,£), we could then 
proceed to study the characteristics of the system as a function of the matrix E. How- 
ever, in the absence of such closed formulations for the k, \j, we must take recourse to the 
less elegant medium of numerical solution. 

The procedures we follow for obtaining the numerical solution are as follows. First, 
the feedback values are computed by solving the feedback differential equations that 
define the k i} . These are not affected by the magnitude of the controller delay. Then, the 
state equations are solved as simultaneous differential equations. These are used to check 
that the terminal constraints have been satisfied, and in the event that they are the per- 
formance functional is evaluated. This procedure must be repeated for each new sub- 
space. Since the environment is deterministic in this case (no wind gusts or other random 
disturbances are permitted in the model), the hard deadline associated with each process 
subspace is a constant and not a random variable. 

The trajectory followed by the aircraft when the delay is less than about 60 milli- 
seconds follows the optimal trajectory closely although the elevator deflections required 
would be intuitively assumed to increase as the delay increases. Also, the susceptibility 
of the process to failure in the presence of incorrect or no input is expected to rise with 
the introduction of random environmental effects. 

The control that is required for various values of controller delay is shown in Figure 
7. Due to the absence of any random effects, elevator deflections for all the delays con- 
sidered tend to the same value as the end of the landing phase (20 seconds) is 
approached, although much larger controls are needed initially. In the presence of ran- 


41 



<*11 °12 °13 °14 
0 10 0 
0 632 633 0 

0 0 1 0 

where 

a n — [l-CiiArn( 0^" 1 [&h - *ii(O c ii -c ii£{^0+2&u*ii(0+*12(0- c ii*ii(0}] 

a 12 = [l -c ll^ll(0£] _1 l&12 -c 11^12(0 -c ll£{^11^12(0+^12^1l(0+^22(0~ c ll^ll(0}] 
<*13 = [l — c ii^n( 0£l 1 I^13 -c 11^13(0+^13^1l(0+^23(0 -c ll^ll( 0 ^ 13 ( 0 }] 

«„ = [l-cfi*ii( »)<]-' 

6 32 = Horizontal Velocity/ T, 

633 = -l/T, 

When the execution delay is £, the approximate state equations are 

i(<) = E(I,{)*(I) + 0 

0 

Figure 6. The Approximate State Equations. 
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dom effects, the divergence between controls needed in the low and the high delay values 
of controller delay is even more marked. We present an example of this in Figure 8. The 
random effect considered here is the elevator being stuck at -35 0 for 60 milli-seconds 8 
seconds into the landing phase due to a faulty controller order. The controlled process is 
assumed in Figure 8 to be in the subspace in which the landing job maps into a non- 
critical process (defined in the sequel as S 0 ). The diagrams speak for themselves. We 
shall show later that this demand on control is fully represented by the nature of the 
derived cost function. Also, above a certain threshold value for controller delay, we 
would expect the system to become unstable. This is indeed the case in the present prob- 
lem, although this point occurs beyond a delay of 60 milli-seconds for all points in the 
allowed state space (obtained in the next section), which cannot by definition occur 
here. 

4.2.1. Allowed State Space 

In this subsection, we derive the allowed state space of the aircraft system. To do 
so, note that in Ellert and Merriam’s model, Xjt does not exist. The reason is that the 
state equations do not take into account the angle of attack. In the idealized model we 
are considering, it is implicitly assumed that the constraint on the angle of attack is 
always honored, so that the only constraints to be considered are the terminal con- 
straints. 

The terminal constraints have been given earlier but are repeated here for conveni- 
ence. The touchdown speed must be less than 2 feet/sec in the vertical direction, and 
the pitch angle at touchdown must lie between 0 ° and 10 To avoid overshooting the 
runway, touchdown must occur at between 4864 and 5120 feet in the horizontal direc- 
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tion from the moment the landing phase begins. The horizontal velocity is assumed to be 
kept constant throughout the landing phase at 256 feet/sec. 13 Thus, touchdown should 

occur between 19 and 20 seconds after the descent phase begins. 1 '* The only control is 
the elevator deflection which must be kept between -35 * and 15 \ 

Since the only constraints employed are terminal, the allowed state-space is exactly 
the set of states from which the terminal constraints can be satisfied. X a is therefore a 
reachability set in control-theoretic terms. However, finding the entire allowed state- 
space can be computationally expensive, so we follow a cheaper alternative. The initial 
conditions of the process as it enters the landing stage are known. Also known is that 
the controller is triggered every 60 milli-seconds. It is assumed that the computations 
take a minimum of 20 milli-seconds to complete. Using these data, it becomes possible 
to determine that portion of the allowed state-space that the controlled process is ever 
likely to enter to a good approximation. In Figure 9, we plot the range of allowed state 
values that we obtain. As indeed it should be, the allowed state-space is a function of 
time. 


4.2.2. Designation of Subspaces 

We subdivide the allowed state-space found above using the method described in 
Section 3. The criterion used is the hard deadline, since the finite cost function (derived 
in the next subsection) is found not to vary greatly within the whole of the allowed 
state-space. The value of A chosen is 60 milli-seconds. In other words, we wish to con- 
sider only the case where a trigger is "missed.” 

13 We do not consider here how that is to be done; in practice this will constitute a second controller 
job. We do not treat this here. 

14 This makes time an "implicit” state variable. 
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Figure 9(b). Allowed State Space: Descent Rate. 


48 








PITCH ANGLE 



Figure 9(c). Allowed State Space: Pitch Angie 


49 



1 0.00 4.00 8.00 12.00 16.00 20.00 

TIME (SEC) 


Figure 9(d). Allowed State Space: Pitch Angle Rate 
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The allowed state-space in Figure 9 is subdivided into two subspaces, S 0 and Sj. 
These correspond to the deadline intervals [120, oo) and [60, 120) respectively. S 0 is the 
non-critical region corresponding to the [120, oo) interval. Here, even if the controller 
exhibits any of the abnormalities considered earlier, the airplane will not crash. In other 
words, if the controller orders an incorrect output, exhibits an abnormal execution delay 
or simply provides no output at all before the following trigger, the process will still sur- 
vive at the end of the current inter-trigger interval if, at the beginning of that interval, 
it was in S 0 . 

On the other hand, if the process is in Sj at the beginning of an inter-trigger inter- 
val, it may safely endure a delay in controller response. However, if the controller 
behaves abnormally in either providing no output at all for the current trigger cycle or 
in ordering an incorrect output, there is a positive probability of an air crash. 

Notice that we explicitly consider only missing a single trigger, not the case when 
two or more triggers might be missed in sequence. This is because dynamic failure is 
treated here as a function of the state at the moment of triggering. If two successive 
triggers are missed, for example, we have to consider two distinct states, namely the 
states the process is in at the moment of those respective triggers. To speak of deadline 
intervals beyond 120 milli-seconds is therefore meaningless in this case since the triggers 
occur once every 60 milli-seconds. This is why the second deadline interval considered is 
[120, oo), not [120,180). 

The hard deadline may conservatively be assumed to be 60 milli-seconds in Sj. By 
definition it is infinity in S 0 . 
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4.2.3. Finite Cost Functions 


The finite cost does not vary greatly within the entire allowed state-space. It is 
therefore sufficient to find a single cost function for S 0 or Sj. 

The determination of the cost function is carried out as a direct application of its 
definition. That is, the process differential equations are solved with varying values of £. 
The value of £ cannot be greater than the inter-trigger interval of 60 milli-seconds since, 
by assumption, no job pipelining is allowed and the controller terminates any execution 
in progress upon receiving a trigger. The finite cost function is found by computation to 
be approximately the same over the entire allowed state-space as defined in Figure 9. 

In Figure 10, the finite cost function is plotted. The cost function is in the units of 
the performance index. Bear in mind that these measures are the result of an idealized 
model. We have, for example, ignored the effects of wind gusts and other random 
effects of the environment. When these are taken into account, the demands on con- 
troller speed get even greater, i.e. the costs increase. 

The reader should compare the nature of the cost function with the plots showing 
elevator deflection in Figure 7, and notice the correlation between the marginal increase 
in cost with increased execution delay and the marginal increase in control needed, also 
as a function of the execution delay. 

5. APPLICATIONS OF THE MEASURES 

5.1. Introduction 

In this Section, we consider two applications of the performance measures that have 
been discussed thus far. We begin with the tradeoff between reliability and throughput 
that is at the heart of distributed computing. We show how the use of our measures 
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makes the resolution of this tradeoff sensitive to the application. 

The second application that we present is to synchronization and fault-masking in 
redundant real-time systems. First, we consider synchronization, both in hardware and 
software. In doing so, we present a theorem that makes possible the design of an arbi- 
trarily large phase-locked, fault- tolerant clock. We show that software synchronization 
techniques are excessively time-consuming, and indeed impose a limit on the size of a 
cluster that can be thus synchronized. Next, the fault-masking techniques of voting and 
interactive consistency (Byzantine Generals) algorithms are considered, and their delay 
overheads estimated. Next, we use these results to compare the reliabilities of reconfi- 
gurable and non-reconfigurable systems, operating in real-time, under the constraints of 
a hard deadline. 

5.2. The Number-Power Tradeoff 

The number-power tradeoff problem can be stated as follows (17). It appears intui- 
tively obvious if there are no device failures, that a system with a tingle processor with 
exponentially distributed service time with mean 1/p is more efficient than an N- 
processor (/V>1) system with each processor providing exponential service at rate p/N. 
When failure is allowed for in the model, the above assertion is no longer obvious, and 
may not even be true in specific instances. So, we ask the question, ” Given that the 
total processing power (number of processors X service rate per processor, called here 
the number-power product ) is fixed at p, what is the optimal number, TV, of processors, 
that the system should start out with for a specific mission lifetime?” We extend an 
adaptation of this problem to demonstrate the use of our performance measures to real- 
time computers. 
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Two observations are in order here. Firstly, we tacitly assume that processors with 
any prescribed power are available. This is not true, although a wide variety of proces- 
sors is available. Secondly, such a tradeoff depends for its resolution upon the cost func- 
tions and hard deadlines introduced above. It is this second point we pursue here. 
Specifically, we set out to determine for an example control computer, the configurations 
that meet specifications of reliability, and the sensitivity of reliability and the mean cost 
to changes in the number-power product under different operating conditions of hard 
deadline and mission lifetime. Implicit in all this will be the tradeoff between device 
redundancy and device speed. 


5.2.1. System Description and Analysis 


We use the multiprocessor system in Figure 11 to demonstrate the idea. Assume 
that there is a single job class that enters the system as a Poisson process with rate X, 
and which requires an exponentially distributed amount of service with mean 1 f\i. Then 
the system at any given time is an M/M/c queue (if the small dispatch time is ignored), 
where c is the number of processors functioning at that time. The distribution function 
for the response time for an M/M/c queue is well known [18]. It is given by: 

{X - cn + /iiv e (0)}(l - e' Mt ) + M {1 - vn(0)}{l - } 


F MM c (0 — 


X - (c - l)/t 


(28) 


where JV^O) is the probability that, when there are c processors functional, at least one 
functional processor is free. This is given by: 




C/i 

c/i-X 


-l 


( 29 ) 


This distribution function is not defined unless cfi > X. 
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Figure 11. A Real-Time Multiprocessor. 
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Assume that the hard deadline has a probability distribution function Fj, and that 
the finite cost function for the task is denoted by g (as in Eq. (18)). Let the processors 
fail according to an exponential law with rate /i p . Also, let n be the smallest integer for 
which n/x>\ — clearly, if there are fewer than n processors functioning, the utilization 
exceeds unity, and failure takes place with certainty. 

The system fails if a hard deadline is violated, or if there are fewer than n proces- 
sors functioning. 15 When the system is in a state i>n, the rate at which failure can hap- 
pen is equal to the product of the task input rate and the probability that the response 
time of the system exceeds the hard deadline. If we assume that steady state is achieved 
between each state transition (i.e. between processor failures), then the probability distri- 
bution function of the response time at state » is always given by F M Mi- Such an 
assumption is valid, since the Mean Time Between Failures for components used in such 
systems typically ranges from 1,000 to 10,000 hours. The probability of dynamic failure 
can therefore be computed using the Markov model in Figure 12. Denoting the probabil- 
ity of being in state « by jr,-, the quantity X (1 - F^it)] by o(j,t), the failure state by 
fail, and the number of processors at start-up by N, the following balance equations can 
be written 

. OO 

MO = ~[N/*p + / 0 Q ( N ,0 dFli) ]M0> M°) =1 ( 30a ) 

. OO 

M) = -( *>p+ f 0 <*(',{) dF/fl ]jr,(<) + (i+l)/i p 7T 1+1 (0, 7T,(0)=0, for n<i<c (30b) 

. N co 

TT/MO = £ {/ 0 «(«>£) dFj(O}x,(0 + W p n n (t), Kja.i (0)=0. (30c) 


15 We do not consider here the failure of the interconnection net, or of the dispatcher. Taking account 
of these is easy, but would obscure the analysis somewhat. 
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Clearly, p iyn (<) = XjailU) so that a solution of the above equations yields the probability 
of dynamic failure. Implicit in these calculations is the assumption that the hard dead- 
line is very much smaller than the mission lifetime or the sojourn time in the various 
states. This is invariably the case in practice. 

To compute the mean finite cost over the mission lifetime, we first have to evaluate 
the distribution function of the response time, conditioned on the event that no hard 
deadline is violated. This distribution function for a e-processor system, denoted by 
F’A, >s given b y: 

F’iAi) = iFi ( 31 ) 

U b MUA T ) 

where the function is only defined for arguments less than the associated hard deadline. 
Then, the mean cost is defined by: 

AT oo oo f 

M = E J 0 /„ /„ * M (r) iFlQ dL(l) (32) 

c—n 

The above expressions are used in the following section to obtain values for the probabil- 
ity of dynamic failure, and the mean cost as a function of the mission lifetime. 

5.2.2. Numerical Results and Discussion 

In what follows, it is assumed that the job arrival rate is X=100, and that the pro- 
cessor failure rate is fi p = 1(T 4 . All time units are in hours. 

Figure 13 is the probability of dynamic failure when the task is non-critical, i.e. the 
deadline is at infinity. In such cases, the probability of dynamic failure reduces to being 
the probability of static failure , namely the probability of failure of hardware com- 
ponents to the point when the system utilization exceeds 100 %. This is because catas- 
trophic failure does not occur in this case until the system utilization is greater than 
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Figure 13. Dependence of Probability of Dynamic Failure on Processor Number when 
Tasks are Non-Critical. 
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unity. This explains the monotonic nature of the plot. 

Figure 14 shows the probability of dynamic failure when the task is critical with 
the deadlines for the respective curves noted in the figure. The number-power product 
is 2200. The curves that result for the failure plot form an inverted bell. The portions of 
the failure curve where the slope is positive can be explained as follows. When the 
number of processors increases, the response time distribution is skewed to the right. 
This in turn increases the probability of failing to meet the hard deadline to a greater 
extent than the static failure probability is reduced by the addition of further redun- 
dancy. The positive slope is the result of this tendency. When the hard deadline is 
smaller, the premium on speed is increased, and as a result, the trough of the curves 
moves to the left. 

In the region corresponding to the fewest processors, there tends to be a tradeoff 
between dynamic failure probability and the number of processors, leading to a negative 
slope for the failure curves. When there are few processors, the fault-tolerance is less and 
the probability of static failure is therefore greater. Since the total processing power of 
the processor bank is fixed, the few processors each have greater power, and the mean 
waiting time is low. This accounts for the very small nature of the non-static com- 
ponent of the failure probability. As the speeds of the individual processors are 
decreased, but their numbers increased commensurately, the static failure probability 
drops, but the probability of missing the hard deadline increases. In the area of the 
curve where the slope of the probability of failure is negative, the benefits accrued from 
adding redundancy outweigh the negative impact of the lowered individual speeds that 
result; elsewhere the reverse is the case. Notice that the curve corresponding to a dead- 
line of 0.01 has no region of negative slope. This means that 2200 is a number-power 
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Figure 14. Dependence of Probability of Dynamic Failure on Processor Number and 
Hard Deadlines. 
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product that is too small with respect to that hard deadline. 


In Figure 15, the dependence of the probability of dynamic failure on the number- 
power product for a mission lifetime of 10 hours is considered. Each point on the curves 
represents the configuration yielding the lowest possible failure probability for the 
number-power product represented. The label of each point on this plot is the number of 
processors in the configuration for which this lowest failure probability is achieved. As 
the product increases, the optimal configuration tends to contain more processors: this 
also is due to the lowering of the non-static component of the dynamic failure probabil- 
ity when the product is increased. 

Naturally, the curves are monotonically non-increasing. They serve to show the 
marginal gain in maximum achievable reliability that is to be had on increasing the 
number-power product at each point for the class of systems under consideration. 
Notice the ” elbows” in the plot. These occur when the minimum failure probability con- 
figuration changes, and are the result of a tradeoff between the static and non-static 
components of the failure probability. The p dyn drops exponentially with an increase in 
the product as long as the static component is a small fraction of p dyn . When the non- 
static component drops to sufficiently below the static component value, the optimal 
configuration changes, and the static component once again becomes negligible compared 
to the non-static component. This race continues indefinitely and is portrayed in Figure 
16. The discrete nature of the processors causes the elbows: if the number of processors 
were a continuous quantity, they would not appear. 

The probability of dynamic failure is used as a pass-fail test for control computers. 
Plots such as Figure 15 can be used in this connection. As an example, let the mission 
lifetime be 10 hours, and the specified probability of dynamic failure equal to 10 -7 over 
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Figure 15. Minimum Achievable Probability of Dynamic Failure. 
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Figure 10. 


Race Between Static and Non-Static Components of Probability of 
Dynamic Failure. 



that period. Let the system parameters be those of the model in this section. Then, 
corresponding to each of the four deadlines considered, we can obtain graphically from 
Figure 15, the minimum number-power product that is required to satisfy the p iyn specif- 
ications. These products are listed in Table 5. Any system that has a smaller number- 
power product must be rejected, no matter what its other credentials may be. 

When this stage of the evaluation is complete, one has a set of acceptable confi- 
gurations. Only after this point does the mean cost come into consideration. The mean 
costs associated with each of the points in Figure 15 is graphed in Figure 17, where the 
finite cost function, g, has been taken as equal to the response time for demonstrative 
purpose. The curves take the form of a sawtooth wave, with each upward transition 
occurring when the optimal configuration increases by one. Clearly, the greater the 
power of each processor, the smaller is the mean cost. 

In Figure 18 (A, B, and C), we show the effects on p Jyn of changing mission lifetime 
for various values of the hard deadline, tj. In the light of the preceding discussion, these 
plots should be largely self-explanatory. It is worth pointing out, however, that as the 
lifetime increases, the optimal configuration contains a larger number of processors. The 
trough (around the optimal point) becomes shallower as one increases the mission life- 
time, until finally, it disappears to be replaced by a shallow trough one unit to the right. 
As the lifetime increases still further, the new trough deepens, then begins to become 
shallow. Whether or not the cycle continues depends upon the hard deadline: it will con- 
tinue so long as the number-power product is sufficiently large to cope with the hard 
deadline at the lifetimes used; the plot will rise monotonically to a failure probability of 
one if this is not the case (cf. Figure 14). 
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Hard Deadline 

Number-Power Product 

0.010 

7165 

0.025 

2126 

0.050 

1496 

0.075 

1180 


Required p d]/n =10 7 
Mission Lifetime = 10 hours 


Table 5. Minimum Number-Power Product for Various Hard Deadlines. 
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Figure 17. Mean Cost for Configurations of Figure 15. 
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Figure 18(a). Probability of Dynamic Failure for a Constant Number-Power Product. 
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Figure 18(c). 







In Figure 19 (A, B, and C), we show the associated Mean Costs per unit time of 
operation using the same cost function as was used in Figure 17. In all three curves, we 
may note the anomaly mentioned in Section 2: as the lifetime increases, and as the 
number of processors increases, there is a region over which the mean costs per hour 
actually drop. It is most pronounced in Figure 19A, where the probability of dynamic 
failure is close to unity under almost all configurations. This anomaly, of course, is due 
to the fact that the mean costs are computed on a response-time distribution that is con- 
ditioned on the system’s not failing. Thus, on comparing Figures 19A, 19B, and 19C, 
we see that the system operating in the longest hard deadline, and therefore having 
greater reliability exhibits a higher mean cost per hour in some configurations than its 
identical counterparts that operate under more difficult conditions. If this causes undue 
irritation, the anomaly can be made to vanish by redefining the finite cost function for a 
task t to be: 


(g,(t) if t < t d 

= l 9,Ud) *>tj 

(33) 

introducing the following functions: 


?,(0 


<*.( 0 = £.9, 
/=» 

(34) 

ei\(o 

i=i 

(35) 

and defining the mean and variance costs to be: 


oo 

Mean Approximate Cost (MAC) = f E{ft(t)} dL(t) 

0 

(36) 

00 

Variance Approximate Cost (VAC) = j Var{f3(t)}dL(t) 

(37) 


o 
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Figure 19(a). Mean Costs for a Constant Number-Power Product. 
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Figure 19(b). 
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Figure 19(c). 
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It is easy to see that MAO Mean Cost always, and that the approximate costs 
approach the accurate costs when the probability of dynamic failure is small. Indeed, the 
anomaly does not appear until the probability of dynamic failure is significant. Since the 
applications under consideration are all critical processes, p Jyn is always small for the 
accepted configurations, and the configurations that exhibit this anomaly will be rejected 
by the p iyn pass-fail test, and their mean costs need never be computed. 

5.2.3. Extension 

The number-power tradeoff can easily be extended to make it very useful in the 
process of design. The reader will have noticed that in the cost functions with which we 
measure the goodness of controller performance, no account is taken of controller 
hardware cost. All that the cost functions express is the control overhead incurred in 
actually running the process. Indeed, we may regard the mean costs as average operating 
overheads. It is not easy directly to incorporate the hardware cost into the cost functions 
themselves. Instead, one may consider the set of hardware configurations available for a 
particular hardware cost outlay. Then, constant-cost plots can be drawn, showing the 
range of performance (in terms of probability of dynamic failure and average operating 
costs) that is available for any particular hardware cost outlay. From similar curves, one 
may arrive at the minimum finite average operating cost associated with a particular 
hardware cost given that specifications for the probability of dynamic failure are met. 

This approach can easily be illustrated with the number-power tradeoff considered 
here. When the hardware cost of a processor is proportional to its processing speed, the 
curves in Figure 13 become dynamic failure curves for a particular hardware cost. 
Curves such as Figure 18 can be used to identify the configurations that meet require- 
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ments for the probability of dynamic failure for given mission lifetimes, and the sensi- 
tivity of Piyn to changes in mission lifetime. 

Also, one can study any other tradeoffs that may exist between the hardware cost 
or the number-power product and the minimum mean cost per lifetime associated with 
such a cost or product. 

The computer studied here is simple; however, it can be extended in some useful 
directions relatively easily. It is easy to take care of the case when the hardware cost or 
the finite cost function is a more complicated function of the processing speed. More 
complicated multiprocessors require a more involved analysis, but the basic ideas should 
now be clear. 

5.3. Synchronization 

Figure 20 is a schematic showing the handling of data as it enters the system 
through the sensors, and leaves it (in a figurative sense) at the actuators. Synchroniza- 
tion and fault-masking are integral parts of any fault-tolerant distributed system. 

When a multiplicity of processors executes code in parallel, care must be taken to 
keep them reasonably in step. Therefore, the issue of synchronization is focal to all 
methods of forward error recovery. There are two basic methods of synchronization: 

(1). Each processor has an ultra- precise clock. When the computer is switched on, the 
clocks are synchronized. If the clocks are sufficiently precise, the processors will 
continue to run in lock-step for an appreciable period. Unfortunately, such highly 
precise clocks are extremely expensive to build, and unsuited to incorporation in 

computer circuits. (For a description of highly precise clocks, see [19]). Clocks 

Ricky W. Butler at NASA Langley Research Center has made a major technical contribution to this 
Section. 
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Figure 20. Functional Block Diagram of a Real-Time System. 






that are generally used in computer circuits drift too rapidly for this method to 
be employed in practice. We shall not consider this method any further. 

(2). The synchronization is carried out mutually. There is no single component whose 
functioning is critical to the security of the whole system. One may choose to syn- 
chronize the processor clocks, or the processors themselves at pre-defined boun- 
daries of software execution. In the first case, one has a system operating more or 
less in lock-step, such as the FTMP system [2]. Both methods of synchronization 
are based on the same basic concepts; the only difference is the frequency with 

which synchronization is carried out. 16 The notion of virtual time sources now 
arises naturally. These are not necessarily clocks in the traditional sense; they 
mark the points at which an individual processor performs synchronization. It is 
convenient to view them as virtual clocks, whose transitions represent either clock 
” ticks” or execution of a stretch of code up to a pre-specified boundary. In the 
sequel, unless it is otherwise stated, the term "clock” is used to mean "virtual 
clock” . 

When synchronization is mutual, no "absolute” underlying time-source exists, only 
a set of time-sources whose relative behavior must be kept in step. The synchronizer 
(which may or may not be a physical part of the processor and which may be imple- 
mented either in hardware or in software) must therefore in each case have a perception 
of the state of the other time-sources. This perception may or may not be identical to 
that of the other synchronizers: if faulty modules necessarily behave consistently with 
respect to all synchronizers, it is identical; otherwise it need not be so. 

19 An important corollary of this is that the maximum clock drift rates that can be tolerated decrease 
with a decrease in the frequency of synchronization. 
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The synchronization process contributes to the system overhead in two ways. 
Firstly, there is the overhead imposed by the synchronization task itself. Secondly, the 
task-to-task communication overhead is proportional to the degree of synchronization 
achieved. 

If hardware synchronization with phase-locked clocks is employed, the synchroniza- 
tion overhead can be reduced to vanishing point. If software synchronization is used, 
the overhead is significant. Both approaches to synchronization will be considered in 
succeeding sections. First, however, we will consider the second component. 

Because of severe timing constraints, real-time systems do not generally use sophis- 
ticated mechanisms for task-to-task communication. Typically, data are transmitted 
from one task to another via timing rules agreed in advance. As a result, the receiving 
task has to wait for a time equal to the sum of the maximum transmission time and the 
maximum possible clock skew before it can read the data. Where synchronization is car- 
ried out in software and depends on the transmission of timing data on regular data 
channels, this transmission delay feeds back to increase the synchronization delay itself. 
We will consider this matter in detail in the sequel. 

Synchronization can be implemented in either hardware or software. In what fol- 
lows, we present a detailed discussion on each of these two implementations. 

5.3.1. Hardware Synchronization 

In this section, we consider synchronization by phase locking. Phase-locked clocks 
were first used to ensure that the processors of FTMP [l] operated in lock step. We con- 
sider a total of N clocks to be synchronized in the face of up to m faulty clocks. The 
clocks are at the nodes of a completely connected graph. The basic theory behind their 
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operation is simple. In Figure 21, we provide a schematic diagram of an individual 
clock. Each clock consists of a receiver which monitors the clock pulses of the N - 1 other 
clocks in the arrangement, and these are used to generate a reference signal. By compar- 
ing this reference with its own pulse, the receiving clock computes an estimate of its own 
phase error. This estimated phase error is then put into an appropriate filter, and the 
output of the filter controls the clock oscillator’s frequency. By thus controlling the fre- 
quency of the individual clocks, they can be kept in phase-lock and therefore synchron- 
ized for as long as the initial phase error is below a prescribed bound, i.e. for as long as 
the clocks started reasonably in step and their drifts are sufficiently low. A discussion of 
clock stability is provided in [21]. 

The arrangement for N= 4, m=l is, to our knowledge, the only phase-locked clock 
constructed and fully analyzed [20]. Unfortunately, when one attempts to increase m 
without care, synchronization can be lost due to the presence of malicious faults. In this 
section, we show how to design phase-locked clocks to tolerate a given arbitrary number 
of malicious failures. Our work is a generalization of the original design [20] which can 
tolerate at most one failed clock. 

5. 3.1.1* Notation and Definitions 

The following notation and definitions are used in this section. 

Definition 1: If the overall system of clocks is properly synchronized, all individual 
non-faulty clocks must agree closely with each other. A well-synchronized system thus 
has global clock cycles . Global clock cycle * is the interval between the i-th tick of the 
fastest non-faulty clock (i.e. the non-faulty clock that has its i-th tick before that of all 
the other non-faulty clocks) and the (»+l)-fA tick of the fastest non-faulty clock. For 
brevity, we shall denote global clock cycle t by gcci. 
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Definition 2s Each of the clocks “sees” through its receiving circuitry, the ticks of the 
other clocks. These ticks, together with the receiving clock’s own tick, can be totally 
ordered in any gcci by the relation “prior or equal to”. Such an ordered set, called a 
scenario, for clock a in gcci is denoted by Sj,. We shall frequently drop the superscript 
for convenience: where this is done, it will be understood that we are talking about some 
gcci. 

If a non-faulty clock c does not receive a tick from clock d within a given timeout 
period in any global clock cycle, the tick for d is arbitrarily assumed by c to be at the 
end of that timeout period. The scenario of every non-faulty clock therefore has exactly 
N elements. 

Definition 3: If clock a has clock b as its reference in some gcci, it is said to trigger on b 
in that gcci. 

Definition 4: Given the various triggers, we can draw a directed graph with the clocks 
as the vertices, and the directed arcs reflecting the relationship “triggers” in some gcci. 
Such a graph is called the trigger graph. For example, in Figure 22, a triggers b and c, 
and is itself triggered by d, while d is triggered by 6. A clique of clocks is a component 
of the trigger graph. In Figure 22, there are two cliques: {a,b,c,d} and {e,f,g}. 

Notation: G and NG are the set of clocks and non-faulty clocks, respectively, in the 
system. There are N clocks in all, and up to m failures must be sustained. 

Definition 6: A partition of G is defined as a set P={Gi,G 2 }, where Gj and G 2 are sub- 
sets of G with the following properties: 
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(i) G = G x |J G 2 

(u) g x n^2 n NG = & 

(Hi) GiftNG ^ <M=1,2. 

From (i), each clock must belong to at least one of G x and G 2 . From (ii), only 
faulty clocks may belong to both G x and G 2 . From (iii), there must be at least one non- 
faulty clock in each of Gj and G 2 . 

Definition 7: A clock a is said to be faster than a clock 6 in scenario S if a precedes 6 in 
S. In a partition P={ G U G 2 }, Gj is said to be faster than G 2 if every non-faulty clock in 
Gj is faster than every non-faulty clock in G 2 . 

Notation: Given a partition P={G U G 2 }, NG X and NG 2 are the non-faulty clocks in Gj 
and G 2) respectively. By definition 6, neither NG X nor NG 2 can be empty and 
NG,QNG 2 = 

Definition 8: Cliques A and B (of clocks) are said to be non-overlapping if the non- 
faulty clocks of A are either all faster than those of B, or vice versa. 

Notation: Denote the position of a clock c in its own scenario 5* c in gcci by p' c . Again, 
we shall frequently drop the superscript for convenience. The reference signal (i.e. the 
trigger) is a function of N and of p c . It is denoted by f p ( A'). By this, we mean that 
clock c triggers on the / Pc (A/)-tA signal in S c) not counting itself. 

For the system to operate satisfactorily, all the non-faulty clocks must have their 
ticks close together. Also, they should tell good time, i.e. the length of every global 
clock cycle should be about the length of an ideal (or absolute time) clock’s inter-tick 
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interval. These conditions dictate the following two conditions of correctness Cl and 
C2. 

Definition 9: Each of the following conditions of correctness must be satisfied in gcci if 
the system is to be correctly operating in every gcci . 

Cl. For all partitions P={G X ,G 2 } of the set of clocks G, in which the non-faulty clocks 
in G x are all faster than those in G 2 , each of the following (Kl and K2) must apply: 

Kl. If, in gcci, all clocks in NG X trigger on clocks in G if then there is at least one 
clock in NG 2 that triggers on a clock in G x . Furthermore, if no clock in NG 2 
triggers on a clock in NG lf at least one clock kENG 2 must trigger on a faulty 
clock h€.G x such that in the scenario S k , there is at least one clock rENG x 
that is slower than the clock h . 

K2. If, in gcci , all clocks in NG 2 trigger on clocks in G 2 , then there is at least one 
clock in NG X that triggers on a clock in G 2 . Furthermore, if no clock in NG X 
triggers on a clock in NG 2 , at least one clock k£NG x must trigger on a faulty 
clock hEG 2 such that in S there is at least one clock rENG 2 that is faster 
than h. 

C2. If a non-faulty clock x triggers on a faulty clock y, then there must exist non-faulty 
clocks z x and z 2 such that z x is faster than or equal to y, and y is faster than or 
equal to z 2 . Either z x or z 2 may be x itself. 

Intuitively, we may regard Cl as preventing the formation of non-overlapping 
cliques — which would obviously destroy synchrony — and C2 as ensuring that the sys- 
tem keeps good time, i.e. that each global clock cycle is close to being the clock cycle of 
an ideal clock. 
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Finally, we assume that the transmission of clock signals through the system takes 
negligible time. This ensures that all non-faulty clocks are seen by all clocks in the same 
mutual order. 

5.3. 1.2. Malicious Failure and Synchronization 

The phase-locked clock system for N= 4, m=l is simple enough to be proved 
correct by an exhaustive enumeration of all eventualities. It is, to our knowledge, the 
only phase-locked clock actually constructed [20]. 

Here, the reference used is the second incoming pulse (in temporal order), i.e. the 
median pulse. Such a clock is proof against the malice of a single faulty clock. To give 
the reader a feeling for why this is so, and to enhance his intuition about malicious 
failure, we provide below a simple explanation. 

Call the four clocks a, 6, c, and d. Let d be the maliciously faulty clock. Because d 
is malicious, it may provide different timing signals (i.e. lie) to different receivers. Since 
the non-faulty clocks by definition send their ticks at the same moment (or do not lie) to 
all the other receiving clocks, the mutual ordering of the non-faulty clocks within every 
scenario is the same for all non-faulty clocks. That is to say, if clock b sees clock a fas- 
ter than clock c in some gcci (i.e. clock a sends its t-th tick to b before clock c does so), 
then a will appear faster than c to both the other non-faulty clocks in the system, i.e. to 
a and c in that gcci. d, however, may appear in different positions in the scenarios of 
the non-faulty clocks since it is malicious. One way of proving that a four-clock 
arrangement works despite cfs being malicious, is to enumerate all possible actions of d 
and show that the system still continues to satisfy the conditions of correctness. 
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Assume without loss of generality that a is prior or equal to 6 which in turn is prior 
or equal to c in some gcci. Consider a sample set of scenarios for our four-clock exam- 
ple. The triggering clock is denoted in bold-face type. 

S a = a<b<c<d 
S b = a<d< 6< c 

S c — a<b< d< c 

The scenario Sj is irrelevant, since d is faulty. 

Notice first that the position of the faulty clock d changes relative to the others, 
while the mutual ordering of the non-faulty clocks remains unchanged, as indeed it 
should. 

It is easy to see that both conditions of correctness will be satisfied, and that the 
clock will operate correctly if the above scenario holds. It is not difficult to write down 
all the 4 3 =64 possible scenarios (with the ordering of the non-faulty clocks fixed as 
above) that are made possible by the arbitrary positioning of rf, and to convince oneself 
that, for all possible scenarios, Cl and C2 are satisfied. 

Unfortunately, if we try to allow for m— 2,3,..., by expanding the system arbitrarily 
without sufficient care, the conditions of correctness can be violated. In fact, it is even 
possible for a system to contain an arbitrarily large number of clocks, and still to be 
vulnerable to just two malicious failures. 

To see this, consider the following example. Let us choose, for each clock y in the 
system, /^(N) as the median clock signal in the scenario, not counting clock y. If N is 
odd (and there is thus an even number of “other” clocks), choose the slower of the two 
middle clocks. Then, f p ( N) is only a function of N. We therefore drop the subscript for 
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this example. Choosing the median signal is certainly good intuition. 

Let there be only two faulty clocks, x x and x 2 , and n=N - 2 non-faulty clocks a u ..., 
V 

Case 1: N>7. Consider some gcci. Assume that a * is faster than 0 / in gcci if k<l. Now, 
let x x and x 2 present themselves as the fastest two clocks to o 1 , ..., a p , and as the slowest 
two clocks to the other non-faulty clocks, i.e. a p+1 ,...a n , where p=|'n/2] — Then, 

the set of scenarios can be represented as in Figure 23. 

Recalling that a clock triggers on the J[N)-th tick in its scenario not counting itself, 
we can draw the trigger graph as in Figure 24. It follows that {dj,..., a p } and { a ^j, ..., 
a n } will be two non-overlapping cliques, no matter how large n may be. It is easy to 
work out the case for N= 7 to convince oneself of this fact. 

Case 2: N<7. This is trivial, and showing that the system is incapable of sustaining 
even two maliciously faulty clocks is left to the reader. 

This has been a cautionary tale of the unbridled use of intuition in designing 
phase-locked clocks. Assured now that a more careful approach is needed, we turn in 
the following section to showing how to expand phase-locked clocks. 

5.3. 1.3. Main Result 

Our job is to (i) find the lower bound, N, on the size of a system of clocks that 
must sustain up to m maliciously faulty clocks, and (ii) find the functions f x {N) for 
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We begin with the following two lemmas. 


Lemma 1: Condition C2 is satisfied for all partitions P={G k ,G 2 } if and only if there 
exist functions f x ( N) for x=l,...,N, such that 

min{m, 2 - 1 } < f x (N) < max{N-m, 2 } (38) 

Proof: Let k be a non-faulty clock such that p* = 2 . We must show that Eq. (38) holds 
for all 2 for which p k is defined iff condition C2 holds. 

Suppose that there exist functions ffN) for 2 = 1 , ...,N satisfying Eq. (38). This 
implies min{m, 2-l}+l < ma x{N-m, 2} for all zE{l,2,...,N} , leading to N>2m+1. 
Hence, it is sufficient to consider the following three cases: 

(i) 2< m: 

Clearly, max{N-m, 2 } = N-m, min{m, 2 - 1 } = 2-1 and therefore 
2-1 < f x (N) < N-m. If the reference clock is non-faulty, we have noth- 
ing to prove. If it is faulty, then since there are at most m faulty clocks, 
there must be at least one non-faulty clock slower than the reference 
clock. Also, from the left half of the inequality, f x >z- 1, and since clock 
k is non-faulty, there is a non-faulty clock (i.e. k itself) faster than the 
reference clock. So, C2 is satisfied. 

(ii) N-m~> 2> m : 

min{m, 2 -l} = m, jnax{N-m,z}=N-m and therefore 

m+1 < f x (N) < N-m- 1. Since at most m faulty clocks exist, if the 
reference clock in S k were faulty, it must appear in S t as slower than at 
least one non-faulty clock (the right half of the inequality), and faster 
than at least one non-faulty clock (the left half of the inequality), and 
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C2 is satisfied. 


(iii) N>x>N-m : 

min{m,x-l}=m, max{ and fn+1 £ fJ[N) £ *-l. As with 

the previous cases, there must appear in S k at least one non-faulty clock 
that is faster than the reference clock, if the reference clock is faulty. 
Also, since k is non-faulty, and appears in the z-th (i.e. position, 

there is at least one non-faulty clock, in particular clock k, that is 
slower than the reference clock in S k , thus satisfying C2. 

Conversely, suppose /j(7V)<min{m,ar-l}. Then, C2 is violated when faulty clocks 
appear in positions of S k . Similarly, if f,fN)>mnx{N-m,x}, C2 is violated 

when faulty clocks appear in positions of S k . Q.E.D. 

Lemma 2: If all clocks in NG X trigger only on clocks in G x (where the notation is the 
same as in definition 9), then the following are equivalent: 

(i) q > min /.(M where q , is the number of non-faulty clocks in G x . 

keNG 2 Pk 

(ii) K1 is satisfied. 

Proof: 

(i) implies (ii): If (i) holds, then it is easy to see that no matter how the up to m 
faulty clocks in G arrange themselves, Kl is satisfied. 

(ii) implies (i): Suppose, to the contrary, that q x < min f Pi {N). Consider the 

nonempty set L = {y : yENG 2 and/ p (7V)= min / Pl (N)}. Assume that there are «<m 

* NG2 

faulty clocks in G v Since the faulty clocks may present themselves in any position in 
any scenario, consider the case where they present themselves in the scenario of every 
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y£L in the g 1 +l,...,gi+» positions. Then, there is no non-faulty clock in Gi that is 
slower than the reference clock of any clock in NG 2) a contradiction. Q.E.D. 


The two theorems below yield the main result of this section. 

Theorem 1: To ensure that, despite up to m malicious failures, the conditions of 
correctness are satisfied, the system must have N>3m+1 clocks. 


Proof: We will only consider here the case of partitions P = {G\,G 2 } in which all clocks 
in NG X trigger on clocks in G v The other case (i.e. K2) can similarly be dealt with. 


Let there be q x and q 2 clocks respectively in iVGj and NG 2 . Let 



belong to G^. Then, the assumption that all non-faulty clocks in Gj trigger on clocks in 
Gi is equivalent to saying that one of the following Eqs. (39) and (40) must apply: 


?i+» > + 1 = /pi^ + 1 

*6 NGi * 


(39) 


which applies if there exists at least one p y , % /EM, such that p y < / Py (N). The addition of 
1 follows from the fact that clock y does not count itself when counting to f p *( N). If 
Py ^ / p (A/) for all J/GM, the following Eq. (40) applies: 

,1+ '' - (40) 


First consider the case where Eq. (39) applies. The condition that Cl (more specifically, 
Kl) holds implies, from Lemma 2, that 

Since this must be true for all partitions of G, we have for all g 1 G(l,...,N-t-l}: 
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, 1 - K 2, / '* W ' , ' +1 


'f <h > 


min / IN). 

k£NG 2 Pk 


Hence, K1 can be written as: 


For all g x €{l, • • * 


{ 






In particular, this is true for < 7 , = max LIN) -»+l. Thus, 

keNG l Pk 


max f p (N) -»‘+l > 

keNGi Pk 


min fJN) 

keNG 2 Pk 


or 


max f p (N) - min JAN) > i-l 

keNG 1 Pk k£NG 2 Pi 


( 42 ) 


Recall that this is true if Eq. (39) applies. Similarly, if Eq. (40) applies, we have 


from an identical argument, 


max JAN) - min JAN) > 

keNG t Pk keNG 2 Pt - 


(43) 


Eqs. (39)-(43) must hold for all possible ». Since there are at most m faulty clocks, 


we must have: 




(42') 


if Eq. (39) applies, and 


S/'* (N) - sV'* (N) * m 


(43’) 


if Eq. (40) applies. 

We first consider the case where Eq. (39) applies. We claim that it implies that 
7V>3m. 
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To see why, let y be the slowest clock in A/ and s the slowest clock in L (with L 
defined as in Lemma 2). Then, due to Lemma 1 and Eq. (42’) the following inequality 
must hold: 


ma \{N-m, pJ > max L{N) > m-l+ min f p .(N) > m-1 + min{m, p^-1} r 44 \ 

* k£NG 1 keNG 2 ' ' 

Then up to m faulty clocks in the system can arrange themselves in any order. In par- 
ticular, they can so order themselves in S y that p y <N-m, and so order themselves in S 2 
that p 2 >m. Since Eq. (44) must hold always, no matter what the faulty clocks do, we 
must have: 


N-m > max LiN) > m-l+ min L,{N) > (m-1) + (m+1) 

keNG, w keNG 2 


(45) 


from which we arrive at the equation 

N>3m (46) 

Recall that this applies whenever Eq. (39) holds. If, instead, Eq. (40) applies, we 

can similarly show that 

N>3m+l (47) 

Since we seek the smallest N to satisfy the conditions of correctness, we have done if we 
can show that there exist functions f z [N) such that Eq. (39) always applies (and therefore 
Eq. (40) never applies), and for which Eq. (45) is satisfied. But, we can always construct 
f z (N) to (i) be monotonically non-increasing functions of x and (ii) satisfy Eq. (45): an 
example of such a construction is provided in the statement of Theorem 2 below. Hence 
Eq. (39) always applies, and N>3m+1, is the necessary condition. 

The case when all clocks in NG 2 trigger on clocks in G 2 can be similarly treated. 

Q.E.D. 
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(2m if x<N-m 

Theorem 2: If N>3m+l and fx(N) = \ m +i jf x >N-m ^en t ^ ie con diti° ns of correct- 
ness are satisfied. 

Proof: ffN) as defined here satisfies Lemmas 1 and 2 and is monotonically non- 
increasing in x. Clearly, C2 holds. Also, it is easy to see that if N>3m and Eq. (39) 
implies Eq. (41), then case K1 in Definition 8 will hold. We therefore only have to show 
that the definition of f x (N) as given above satisfies Eq. (41) if Eq. (39) is satisfied. This 
can easily be verified by a direct substitution. 

Case K2 can be similarly seen to hold. Q.E.D. 

It should be noted that the set of functions ff N) is not always unique. From the 
proofs of Theorems 1 and 2, the following inequalities are sufficient: 

(i) m+l < f z (N) < A'-m-l for all x = 1 

(ii) ffN) > m-l+fuL^N) for all x<m+ 1, 

(»') < fJLN) < f m+ i{N) for N-m>x>m+ 1, 

(iv) m > m iff «'<;'• 

The intervals x>N-m and x<m+l arise from the up to m faulty clocks in the sys- 
tem. All that we can tell about the fastest non-faulty clock g in the system (this clock 
must have the maximum value of f x (N)) in clock g's scenario is that it is in the first m+l 
clocks in that scenario. Similarly, all that we can tell about the position of the slowest 
non-faulty clock s in the system (which must have the minimum value of f x (N)) is that it 
occupies a place in the last m+l clocks. This leads at once to the intervals x>N-m and 
z< m+l. 
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It is interesting to note that if conditions Cl and C2 are both satisfied, and the 
functions f z {N) are monotonically non-increasing in z, then a stronger condition than C2 
automatically holds. 


Corollary: If the conditions of correctness are satisfied; with the fj( N) being defined as 
monotonically non-increasing functions of z> then the following condition C3 holds. 

C3. Every non-faulty clock necessarily triggers on either a non-faulty clock, or a faulty 
clock that is sandwiched between the other non-faulty clocks. 


Proof: Now, C3 follows immediately from C2 for all but the fastest and slowest non 
faulty clocks. 


Consider the fastest non-faulty clock. In the course of proving Theorem 1, it was 
established that N-m > f x ( N) > m for all z=l and that 

max L(N) - min L(N) > m-1, leading to 2m as the smallest value for max LIN) where 

k€G Fk kEG Vk keNG Fk 

NGC.G is the set of non-faulty clocks. From the monotonic nature of the f Pk {N), the 


trigger for the fastest non-faulty clock must lie in the interval 2m+l, ..., N-m. But, since 
AT>3m+l, any faulty clock in this interval must be sandwiched between non-faulty 
clocks. 


The proof for the slowest non-faulty clock is similar. Q.E.D. 


Remark 1: Synchronization Overhead 

In the case of a phase-locked clock, there is some time overhead due to the oscilla- 
tions that are possible as a result of malicious behavior. However, these are minimal 
when good crystal clocks are used, and so it is reasonable to treat the overhead of 
hardware synchronization as negligible. Also, the clock skew is very small/negligible in a 
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well-designed phase-locked clock. 


Remark 2: An Alternative Design 

The only other hardware arrangement that we are aware of for keeping synchroni- 
zation in the face of malicious behavior is the multi-stage synchronizer arrangement pro- 
posed by Davies and Wakerly [22]. The idea is shown in Figure 25. It consists of m 
stages of TV synchronizers each. The system works on the principle that, with this redun- 
dancy, there must be at least one level of synchronizers that assures proper synchroniza- 
tion in the presence of malicious faults. An informal proof is provided in [22]. 

This arrangement results in a proliferation of hardware. As may readily be verified, the 
total number of devices (processors and synchronizers) in the cluster is 2m 2 +3m+l. The 
total number of I/O ports required is given by 8m 3 +16m 2 +10m+2. The potential enor- 
mity of the above numbers should be driven home by the consideration that in order to 
maximize returns from redundancy, the individual modules must be isolated from one 
another as much as possible. This dictates that power supplies must also be replicated 
in large numbers, and that the benefits of large-scale integration cannot be brought to 
bear on the issue: individual synchronizers must be on separate devices — even, perhaps, 
on separate cards. Otherwise, correlated and common-cause failures could wipe out relia- 
bility gains made by device redundancy. 

Compared with the gargantuan nature of the redundancy required by the Davies 
and Wakerly approach, the A/=3m+l requirement of phase-locked clocks represents an 
extremely elegant hardware solution to the problem of synchronization in the presence of 
malicious faults. 
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Figure 25. Davies and Wakerly’s Multistage Synchronizer 















5.3.2. Software Synchronization 

The use of decentralized algorithms for synchronization offers an alternative to the 
hardware methods described above. Such algorithms enable a system consisting of many 
processors with their own clocks to operate in close synchrony. The degree of synchroni- 
zation obtained by these algorithms depends primarily on the performance of the com- 
munications system, the precision of the clocks, and the frequency of resynchronization. 
The task-to-task communications system’s one-way message time is at least B+6 where 
B is the maximum transmission time and 6 is the maximum clock skew. The most 
time-efficient of the software algorithms that we know of is the interactive convergence 
algorithm [23]. 

In the interactive convergence algorithm, each processor in the system determines 
its skew relative to every other processor in the system. If any relative skew is greater 
than a predetermined threshold, it is set to zero. An average of all the relative skews is 
calculated and used to correct its clock. 

The following theorem (a trivial adaptation of one proved in [23]) characterizes the 
maximum clock skew of the system in terms of the following system parameters. 

£ - maximum error in reading another processor’s clock 

p - maximum drift rate between any two clocks in the system 

N - number of clocks in the system. 

m - maximum number of faulty clocks accommodated. 

R - resynchronization period. 

5(A) - execution time of the resynchronization task. 
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S 0 - maximum clock skew at start-up. 

Theorem 3 (adapted from [23]): If the following conditions hold: 

3m<N 

6 > [i-^-2p(i~ + />{« + 2-^p m}} 

8 ^ Sq + pR 
max(S, S(AQ) < R 

pS«(, 

then, the non-faulty clocks remain in synchrony, i.e. the maximum skew is 8. 

The synchronization algorithm is run periodically, the major component of the exe- 
cution time usually being the time required to read every other processor’s clock in the 
system. In the SIFT system, each processor’s clock value is broadcast during a window 
of time allocated to it. There are N such windows, one for each processor in the system. 
All other processors wait during this window to receive the broadcast data value. 

In order to accommodate the worst-case situation, each window must be at least 
B+8 long. The interactive convergence algorithm takes an execution time equal to 
S(N)=N(B+6)+K, where K is the time needed to compute and carry out the clock 
correction. 

It should be noted that this execution time of the synchronization task affects the 
synchronization process itself. Indeed, since this is a function of N, there is a maximum 
cluster size that can be synchronized in this way. To see this, substitute the above 
expression for S(N) in the formula for 6, and obtain: 

S > N\N-3m-2p(rJ i +N-mN-m)]~ 1 [2e+/>{/?+2(N-m)(i?-|--^)}] (43) 
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From this, one can (a) compute the minimum execution time of the synchronization task 
as a function of the cluster size, (b) obtain the quality of synchronization (the smaller 
the 6, the better the synchronization), and (c) determine the largest possible cluster that 
can be thus synchronized: this is the largest Wfor which S(N) < R. 

The values for the SIFT system are given by £=18.2 micro-seconds, and execution 
time for the synchronization task is 1.760 milli-seconds. Numerical results on the syn- 
chronization overhead using these values are plotted in Figure 26. The maximum cluster 
size permissible for synchronization is tabulated in Table 6. 

Although the expression S(N) = N{S+B)+K was presented as emanating from the 
SIFT system, it is easy to see that in any system where communication is by broadcast, 
and clock transmission slots are pre-determined, this expression will hold. It should also 
be reiterated that such communication protocols are the most commonly used protocols 
in real-time systems. In any case, it is obvious that whatever the protocol used, S(N) is 
very unlikely to be less than a first order function of N. 

Even if, in a hypothetical case, S(N) were negligible (which, of course can never 
happen but nevertheless represents an extreme case), 6 will continue to be a function of 
N, and there will be a point for which 6>R, at which synchrony will break down. 

5.4. Voting and Byzantine Generals Algorithm 

5.4.1. Voting 

Once the delay involved in synchronization is taken account of, there is very little 
additional delay if the voting is carried out in hardware. With software voting, however, 
the additional overhead can be significant. 
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Figure 26. Software Synchronization Overhead. 
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34 

5 X 10~ 4 
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1XKT 3 

16 


** SIFT Value 


Table 6. Maximum Cluster Size Permissible for Software Synchronization 

















Voting in software is carried out by individual processors placing data to be voted 
on in pre-specified "mailboxes” or "pigeonholes”. The voter searches the mailboxes for 
valid data, fetches them, and then votes on them. The execution time in the retrieval 
step is directly proportional to the number of processors in the cluster, N. The execution 
time required to vote /V values and diagnose up to m faults is at least (Af-l)^ + C 2 but 

less than [(AM) + 2m- 2] + C 2 = [^l^-l)-2] C t + C 2 where C x C 2 are some con- 

3 

stants [24]. 

Experimental data exist for 3-MR and 5-MR in SIFT. These data can easily be 
introduced into the linear model obtained above. If s is the number of data values 
voted, and V^s) the time taken for an iV-w ay vote on s data values, the following 
expression was found to hold for SIFT: 

~ 58.5 s N + 91.5*4-38 micro-seconds. (49) 

This is a large overhead: in SIFT, for example, voting is performed at the beginning of a 

3.2 milli-second subframe. If s=6, N=5, then 73% of the subframe is consumed by the 

voting algorithm [25]. 

5.4.2. Byzantine Generals Algorithms 

The Byzantine Generals, or interactive consistency, algorithm must be used when it 
is necessary to isolate the sources of errors as well as to mask the errors themselves. It 
finds use when reconfiguration upon failure is to be attempted and the executive is dis- 
tributed. The algorithm takes into account the fact that faulty processors may be mali- 
cious, in other words, that they need not fail only in "safe” directions. To be absolutely 
certain that faulty processors can be properly identified for isolation, it is necessary to 
allow for every possible misbehavior: thus the case when a faulty processor is malicious, 
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i.e. that it actively and intelligently attempts to hide its malfunction, must also be han- 
dled. Such algorithms are typically used to reach agreement between processors in a clus- 
ter on incoming sensor data, and in certain clock synchronization algorithms. For 
further details, see [26-28]. 

The input of data is accomplished by every processor reading the external sources 
independently or by one processor reading the external sources and then distributing the 
obtained value to the rest of the processors. In the first case, each processor would very 
probably get a different value — even if they were in perfect synchrony — due to the 
inherent instability in reading analog data. Hence, a subsequent exchange of values read 
along with a mid-value selection is required to get a consistent value. However, this pro- 
cess suffers from sensitivity to malicious faulty processors and interactive consistency (or 
Byzantine Generals) algorithms are essential where fault isolation and reconfiguration 
are required. 

The interactive consistency algorithm consists of the following steps: 

(1) The source value is distributed to the N processors. 

(2) The received values are exchanged m times to handle up to m faulty processors. 

(3) A consistent value is obtained by use of a recursive algorithm. When m=l, this 

reduces to a majority calculation. 

The overhead for these interactive consistency algorithms can be considerable. N 
must be at least 3m+l. The number of messages required to obtain interactive con- 
sistency is of the order of AT" -1 . To give an idea of the actual numbers incurred in prac- 
tice, some experimental results from the SIFT computer [25] are used. 

In SIFT, with five-way voting, only one fault can be located. The simple flight- 
control applications currently running in SIFT use 63 external sensor values, each of 
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which goes through the interactive consistency algorithm. From the data collected, exe- 
cution times for steps (1) and (2) of the algorithm can be estimated, and a lower bound 
determined for step (3). The following data were measured: step (1) : 3.05 ms, step (2) 
: 2.22 ms, and step (3) : 6.57 ms (total 11.84 ms). For larger m, the step (1) execution 
time should not change significantly, while the step (3) calculation would require at least 
6.57 ms (very likely much more). The step (2) process consists of only message exchanges 
and thus varies directly with the number of messages which are sent. The following for- 
mula represents an approximate execution time for step (2) as a function of m: 
2.22 N ms). We may add the timing values for steps (1) and steps (3) above to this 
this expression to obtain a lower bound for the overhead of the Byzantine Generals algo- 
rithm in SIFT. Since the interactive consistency tasks must be executed at the data sam- 
ple rate, a large portion of the available CPU time is consumed: see Table 7. 

These results indicate the extremely high overhead imposed in an attempt to 
achieve interactive consistency. It should be pointed out that there have lately been 
some more efficient implementations of the Byzantine Generals algorithm [29] than have 
been implemented on SIFT. However, even such implementations exhibit high overheads 
as the number of faulty modules to be accommodated increases. 

5.5. Reconfigurable and No n- recon fig ur able Systems 

To locate faults after they have been detected by voting, diagnostic tests must be 
run. To ensure agreement amongst all non-faulty processors about the results of the 
tests, the interactive consistency algorithm must be executed. 

Unfortunately, as we have seen, this algorithm is extremely time-consuming to run. 
Reconfigurable systems must therefore contend with a large overhead as compared to 
non-reconfigurable systems. 
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Data Sample Period 

m=l 

m=2 

m=3 

100 ms 

11.8 % 

25.1 % 

>380% 

50 ms 

23.7 % 

50.2 % 

>760% 

33 ms 

35.9 % 

76.2 % 

>1140% 

25 ms 

47.4% 

>100 % 

>1520% 


m = number of faulty processors accommodated 


Table 7. Overhead of Byzantine Generals Algorithm: Lower Bound for SIFT 
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However, reconfigurable systems have the advantage of dynamic redundancy 
management. When widespread failures occur, it is possible to retire some clusters in 
order to keep others at full strength. Also, by periodically purging itself of faulty com- 
ponents, a reconfigurable system can survive in the face of more failures than can a 
non-reconfigurable system. For example, if one started operation with a 7-cluster, the 
reconfigurable system would not fail unless either (a) all but two processors fail, or (b) 
more than m (m= 2 for a 7-cluster, and 1 for a 4-cluster) processors fail between succes- 
sive tests, while the corresponding non-reconfigurable system would fail if more than 3 
processors failed. This does not automatically mean that a reconfigurable system is 
necessarily better than a non-reconfigurable one, since as we shall see, timing require- 
ments impose severe constraints on the size of reconfigurable clusters. 

We shall contrast the reliability of reconfigurable and non-reconfigurable systems 
with the following example. Assume that there is a single critical task in the system that 
requires 1.6 milli-seconds to run, and that this task is dispatched every 50 milli-seconds, 
and that the system must be ready to begin executing the task the moment it is 
released. There is a total of N processors available. Processors fail according to an 
exponential law with specified MTBF. The mission lifetime (duration between successive 
service stages) is also specified. 

In the following sections, we consider non-reconfigurable and reconfigurable systems 
separately. In both cases, we assume that synchronization is by means of phase-locked 
clocks. Since these can be made arbitrarily reliable and are common to both reconfigur- 
able and non-reconfigurable systems, we do not consider the probability of clock failure 
in what follows. Numerical results in the section on reconfigurable systems are based on 
the lower bounds obtained from SIFT. Processor failures are assumed to occur indepen- 


110 



dently, forming a Poisson process with mean interarrival time 5X10 5 seconds. 

5.5.1. Non-Reconfigurable System 

Under the above assumptions, the probability of dynamic failure is simply equal to 
the probability of static failure, i.e. the probability that fewer than \N/2\ processors fail 
over the mission lifetime. This probability is graphed in Figure 27. 

5.5.2. Reconfigurable System 

We assume here that the interactive consistency algorithm will only be invoked 
when a vote detects processor failure. The problem of replicating simplex data to 
amongst multiple processors is not treated here: it is assumed that the slight variations 
in analogue sensor data obtained without the Byzantine algorithm are acceptable. The 
purpose of the interactive consistency algorithm here is to obtain agreement on diagnos- 
tic tests. The assumptions are that the tests have 100% coverage, and for convenience, 
that the diagnostics take 3 ms. Clearly, the diagnostic period is insensitive to the value 
of m. It is not difficult to alter the analysis to allow for a relaxation of these assump- 
tions. Doing so may alter the numerical values presented, but will not change the quali- 
tative nature of these results. 

The execution time is bounded below by 2.22iV m_1 +9.6 milli-seconds. Since the 
task execution time is 1.6 milli-seconds, an unreplenishable reserve of 50-1.6=48.4 milli- 
seconds of time is available. If the overhead is smaller than this, the probability of 
dynamic failure is equal to the probability of hardware failure: if not, it is equal to unity. 

As may be seen from a simple calculation, for clusters with m> 2, the overhead 
exceeds the reserve of time, so that the maximum allowed size of the cluster is AT=7, 
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Probability of Dynamic Failure 



Figure 27(a). 


Comparison of Reconfigurable and Non-Re configurable Systems 
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Probability of Dynamic Failure 



Figure 27(b) 
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Probability of Dynamic Failure 



Figure 27(c) 
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m— 2. For this reason, if we start with more than 7 processors, the additional processors 
will have to be on stand-by for inclusion upon a failure within the cluster. Failure occurs 
if either during a single execution more than m failures occur in the cluster, or the pro- 
cessor pool is exhausted, i.e. if there is an insufficient number of processors left to make 
up a cluster. 

The reconfiguration policy is simple. The system begins operation with either a 7- 
MR or a 4-MR cluster (depending on the value of N). As processors fail, they are 
replaced if spares are available. If the stock of spares is exhausted, further failures are 
handled by the 7-cluster reconfiguring into a 4-MR cluster. If N<7, only a single failure 
can be tolerated. It is thus a combination of hybrid and adaptive voting. 

Numerical results for the probability of failure of reconfigurable systems are plotted 
in Figure 27 for a ready comparison with their non-reconfigurable counter-parts. 

It is apparent from Figure 27 that while increasing the number of available proces- 
sors in a non-reconfigurable system reduces its probability of failure, there is a lower 
bound to the probability of failure for reconfigurable systems. This bound is caused by 
the fact that the cluster size is limited to 7, since the overhead exceeds 100% for larger 
clusters. There is therefore a point after which the probability of more than m processor 
failures over a single execution (aggregated over the mission lifetime) becomes the dom- 
inant component in the probability of failure. As one might expect, the reconfigurable 
system performs better than the non-reconfigurable system when the mission lifetime is 
larger. This has been at the horrible price, in this case, of a 50.3% overhead. 

Clearly, the results in Figure 27 are problem-specific, indeed, they are critically 
dependent on the length of the inter-dispatch interval, the unreplenishable reserve of 
time that is available, and the time taken to execute the Byzantine algorithm. Also, 
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while the recon figurable system may appear to be the better performer in Figure 27, this 
is largely due to the large reserve of time available. Suppose that the task, instead of 
taking a maximum of 1.6 milli-seconds to perform, took 35 milli-seconds. Then, the 
reserve of time is 50-35=15 milli-seconds, and the largest reconfigurable cluster that 
could fit in this reserve is N= 4, m=l. In Figure 28, we display results for such a task. 
Naturally, the reconfigurable system comes out much more poorly here. 

6. CONCLUSION 

In this report, we have characterized real-time computers by (i) introducing new 
performance measures for computers used in the control of critical processes, and (ii) 
applying the measures to design and analysis of of real-time computers. 

Studying the behavior of the controlled system as a function of the computer 
response time provides a means for the effective design of computer controllers in the 
context of controlled processes. As we saw in the examples in Sections 3 and 4, this 
includes such things as control policy. This means that while the cost function is defined 
explicitly in terms of the controller response time, all facets of the controlled process are 
implicitly included in the calculations. 

Due to the objectivity of the cost functions, they can be used with some confidence 
for the design of real-time control computers (architecture and operating system design) 
as optimization criteria. The probability of dynamic failure is to be used as a pass-fail 
criterion with comparison of rival systems on the basis of the mean cost limited to sys- 
tems exhibiting an acceptably low probability of dynamic failure. The inclusion of 
hardware or life-cycle costs into the analysis is also possible, as indicated in Section 5. 

In addition to the applications treated in Section 5, the performance measures are 
useful as criterion functions in the following areas: 
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(1) Optimal placement of checkpoints for backward error recovery. 

(2) Optimal task allocation and reallocation strategies. 

(3) Optimal control of queues at shared resources. 

(4) Optimal routing policies at interconnection networks. 

(5) Measuring sensitivity of reliability and operating overhead on the redundancy and 
bandwidth of the interconnection links. 

(6) Optimal event-handling and time scheduling. 

The design procedures used at present for control computers are ad-hoc, principally 
because of the lack of adequately objective means for the characterization of controller 
performance. The expression, through a scalar metric, of the performance of the con- 
troller in the context of the process it is controlling, is important in making the con- 
troller design and evaluation process systematic. All facets of the controller-controlled 
process relationship are taken into account in the performance measures here presented: 
while the measures are explicitly functions of the response time of the controller, they 
are implicitly functions of the characteristics of the controlled process. It is this fusion of 
controller and controlled process characteristics that is novel, and that distinguishes the 
work presented in this report from those of others. The chief utility of these measures is 
also derived from this accounting of the synergistic coupling between controller and con- 
trolled process. 
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APPENDIX: EXPRESSIONS FOR FINITE COST FUNCTION 


Finite cost functions for Example 2 in Section 2.4 can be expressed by an if-then-else 
construct as follows: 
if x li x 2 i>0 then 

if | x 2 j | >k then 

g(xi,0 = ~ 2 [ti(xi,k,0 + tjfxj ,-k,f)] 

else 

g( x i>0 = ~ 2 [tjfxj, sgn(x 2i )k, {) + t 2 (xj, -sgn(x 2 j)k, 0] 

else 

if | x 2i | >k then 

g(xj,0 = 1 [t 2 (xi,k,0 + t 2 (xi,-k,f)] 

else 

gfxj.fl = ~ 2 [ti(xi, -sgn(x 2 j)k, 0 + t 2 (Xj, sgn(x 2 j)k, £)] 

end if; 


where a = H/m, y(x 2i ,k)=x 2i +k, x i =(x li ,x 2i ) T , 


^Xj,k,0 


2a | X| j | -y 2 (x 2i ,k)-2a | y(x 2i> k) | f 
4ay(x 2i ,k) 


t-ifo-k^) = £ + 


I y(x 2 i,k)(l+v^) | 
a 
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t 2 (x i( k,0 


£+t{x ,k £) | \ Xli \ " \ y ( X2i ’ k ) \ U+*(Xi,k,Q) _ 

* , y 2 (x 2 i,k) , 0 /y 2 (x 2 „k) I Xjj I - I y(x 2i ,k) J { 

+ W~ ; 


if e< 2a I X l' I ~y 2 ( x 2i> k ) 

_ 2a | y(x 2i ,k) | 
otherwise 
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